bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Skip large pdfs #17

Closed jelmervdl closed 3 years ago

jelmervdl commented 3 years ago

Ugly workaround for #16 so I can continue processing warcs.

For ParaCrawl this solution is, I think, sufficient. I do not expect troves of bilingual texts in these massive pdfs, rather just a bunch of scanned books with at most not-too-great OCR applied to them.