Ugly workaround for #16 so I can continue processing warcs.
For ParaCrawl this solution is, I think, sufficient. I do not expect troves of bilingual texts in these massive pdfs, rather just a bunch of scanned books with at most not-too-great OCR applied to them.
Ugly workaround for #16 so I can continue processing warcs.
For ParaCrawl this solution is, I think, sufficient. I do not expect troves of bilingual texts in these massive pdfs, rather just a bunch of scanned books with at most not-too-great OCR applied to them.