bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
552 stars 55 forks source link

feat: enricher to extract the article/text content from newsarticles #107

Closed msramalho closed 10 months ago

msramalho commented 10 months ago

Using a library like https://newspaper.readthedocs.io/en/latest/ implement an enricher that, given a page html, can extract the textual content from it.

Bonus point if it can received/read the output of the wacz archiver from the .wacz archive and extract the HTML directly, otherwise it can make its own request or even be combined with a new simple enricher that downloads the HTML from a URL.

liliakai commented 10 months ago

See also https://github.com/alexander-matz/news3k

msramalho commented 10 months ago

it would be curious to see how well can LLMs perform this task, especially for non-standard newspaper formats

liliakai commented 10 months ago

plot twist: browsertrix is already doing some text extraction for us in a file called pages/pages.jsonl generated along side the wacz archive. I'll look at surfacing this content from the wacz_enricher.

msramalho commented 10 months ago

closing for now, further enrichments/improvements still welcome if someone benefits from them.