Feat/pymupdf experiment

crocs-muni / sec-certs

Tool for analysis of security certificates and their security targets (Common Criteria, NIST FIPS140-2...).

https://sec-certs.org

MIT License

9 stars 7 forks source link

Feat/pymupdf experiment #368

Closed adamjanovsky closed 8 months ago

adamjanovsky commented 8 months ago

This closes #364

adamjanovsky commented 8 months ago

@J08nY The datasets computed with pdftotext and pymupdf are available at Aura, sitting at /var/tmp/xjanovsk/certs/sec-certs/dataset/toy_dataset_100_certs; you should have read access there.

Performance wise, the processing speed seems worse for mupypdf, but I guess we don't care that much.

⚠️ EDIT: There's apparently some bug in pymupdf processing, please don't investigate the comparison until @dmacko232 fixes that.

adamjanovsky commented 8 months ago

@dmacko232 Do we know what are the internal dependencies of pymupdf package? Could we drop dependency on poppler if we make a switch?

Also, we're scanning some tables in FIPS documents with some java tool. Could we get rid of the java dependency as well?

dmacko232 commented 8 months ago

@adamjanovsky Poppler is not dependency. The java thing should not be dependency either I guess in case of pymupdf.

adamjanovsky commented 8 months ago

Closing this. The details of what would it take to get pymupdf surpass pdftotext in terms of output quality are described in #364