Closed adamjanovsky closed 8 months ago
@J08nY The datasets computed with pdftotext
and pymupdf
are available at Aura, sitting at /var/tmp/xjanovsk/certs/sec-certs/dataset/toy_dataset_100_certs
; you should have read access there.
Performance wise, the processing speed seems worse for mupypdf
, but I guess we don't care that much.
⚠️ EDIT: There's apparently some bug in pymupdf
processing, please don't investigate the comparison until @dmacko232 fixes that.
@dmacko232 Do we know what are the internal dependencies of pymupdf
package? Could we drop dependency on poppler
if we make a switch?
Also, we're scanning some tables in FIPS documents with some java tool. Could we get rid of the java dependency as well?
@adamjanovsky Poppler is not dependency. The java thing should not be dependency either I guess in case of pymupdf.
Closing this. The details of what would it take to get pymupdf surpass pdftotext in terms of output quality are described in #364
This closes #364