Closed michaelneale closed 1 day ago
ahh, didn't realize about licensing. is there an automation we can add to audit/check these things? i'm not very well versed in the legalese to know
On Tue, Oct 22, 2024 at 18:32 Michael Neale @.***> wrote:
@.**** commented on this pull request.
In src/goose/synopsis/toolkit.py https://github.com/block/goose/pull/182#discussion_r1811635410:
- Args:
- url (str): url of the site to visit.
- Returns:
- (dict): A dictionary with two keys:
- 'html_file_path' (str): Path to a html file which has the content of the page. It will be very large so use rg to search it or head in chunks. Will contain meta data and links and markup.
- 'text_file_path' (str): Path to a plain text file which has the some of the content of the page. It will be large so use rg to search it or head in chunks. If content isn't there, try the html variant.
- """ # noqa
- friendlyname = re.sub(r"[^a-zA-Z0-9]", "", url)[:50] # Limit length to prevent filenames from being too long
- try:
- result = httpx.get(url, follow_redirects=True).text
- with tempfile.NamedTemporaryFile(delete=False, mode="w", suffix=f"_{friendly_name}.html") as tmp_file:
- tmp_file.write(result)
- tmp_text_file_path = tmp_file.name.replace(".html", ".txt")
- plain_text = re.sub(
- r"<head.?>.?|<script.?>.?|<style.?>.?|<[^>]+>",
it is GPL (v3) so a no go (already looked at that)
— Reply to this email directly, view it on GitHub https://github.com/block/goose/pull/182#discussion_r1811635410, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFPCKETXZ3HIS6ESE26NH3Z434DNAVCNFSM6AAAAABQNSKXZSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGOBWGY4TKOJVGU . You are receiving this because you commented.Message ID: @.***>
@lamchau yes! https://github.com/block/goose/pull/184 - can do it that way
This was stuff that didn't make it over yet