Closed msramalho closed 11 months ago
it would be curious to see how well can LLMs perform this task, especially for non-standard newspaper formats
plot twist: browsertrix is already doing some text extraction for us in a file called pages/pages.jsonl generated along side the wacz archive. I'll look at surfacing this content from the wacz_enricher.
closing for now, further enrichments/improvements still welcome if someone benefits from them.
Using a library like https://newspaper.readthedocs.io/en/latest/ implement an enricher that, given a page html, can extract the textual content from it.
Bonus point if it can received/read the output of the wacz archiver from the .wacz archive and extract the HTML directly, otherwise it can make its own request or even be combined with a new simple enricher that downloads the HTML from a URL.