Evaluate possible switch from pdftotext to PyMuPDF

adamjanovsky commented 9 months ago

We should evaluate whether it's worth switching from pdftotext to PyMuPDF for data extraction. We should:

@dmacko232 provide code that substitutes pdftotext conversion with an identical functionality achieved using PyMuPDF
@adamjanovsky run the new pipeline on full dataset and provide results to @J08nY
@J08nY compare the outputs on both datasets, decide if we should switch

Note: @adamjanovsky will handle the second step instead of originally advertised @GeorgeFI

J08nY commented 9 months ago

I don't think this will happen in one iteration as explained in this issue. I had a quick look at the tool and there ar many ways of using it that will clearly affect performance on downstream tasks (like regex matching). See for example: https://pymupdf.readthedocs.io/en/latest/recipes-text.html and the notes on extracting text with proper ordering, no bad line-breaks, etc.

So perhaps we will have to run through this a few times and compare what is better/worse.

dmacko232 commented 8 months ago

I added this in a new branch feat/pymupdf_experiment. However, it seems like some FIPS tests are failling, so the text quality is most likely worse using PyMuPDF.

J08nY commented 8 months ago

Is this ready? Should I have a look?

dmacko232 commented 8 months ago

@J08nY I will let u know later today or early tommorrow, pdftotext and pymupdf should be close to ready (just need to check if everything is fine). And since we also wanna check pdfplumber then that is not 100% ready yet. Anyway, the current plan is also that I will try to check it separately on the same documents as u.

dmacko232 commented 8 months ago

@J08nY Sorry that I am bit late, I just had in the end much less time available last 2 days than expected. On aura in /var/tmp/xmacko1/master_thesis/code/sec-certs-nlp/data/toy_dataset_100_certs. There are subfolders of: pymupdf -- pdf files processed using PyMuPDF (using "rawDict" format setting, leveraging OCR for lines of text that are incorrectly parsed due to encoding issues and using computationally expensive algorithm to find tables) pdftotext_bbox_layout -- pdf files processed using pdftotext with -bbox-layout flag, which is suitable for search use case; also the rationale behind this was that it allows to show bounding boxes for search BUT also possibly fix some issues as words being split by hyphenation pdftotext_old -- current implementation in sec-certs toy_pdfs -- here u can find pdf files I computed this on

Subfolders of adobe_extract and pdfplumber should be ignored right now.

For all of these there is subfolder for both reports and targets. Then the there three more nested subfolders for each in the experiments: bbox -- contains detected bounding boxes, relevant for my evaluation of bounding boxes text -- contains processed text, this is what is relevant for u text_postprocessing -- currently empty, I am to hopefully add here text with some improvements but I didn't manage to get it working yet.

The possible improvements which apply to both pdftotext_bbox_layout and pymupdf

further merging of some of the blocks
dealing with hyphenation splitting words on two lines
fixing issues where some text is duplicated inside the word because of underlying pdf structure, for example "Card" in bold text can be in result as "CCCaaarrrddd" because that is how some software decided to create a bold text :)
in pymupdf, some table text might be duplicated now, so dealing with that
dealing with big initial letters splitting word into two words

As for the evaluation. I will also try to evaluate the texts AND bounding boxes. I would like to ask u if possible to write down why u think one algorithm output is better than others? When looking for some metrics how to evaluate this more rigorously I found "metrics" such as:

numbers of spurious, missing, rearranged paragraphs (blocks) of text;
number of spurious, missing, misspelt words;
number of spurious, missing new lines (hard to evaluate);
tables. Tho, I think it is kinda hard to go over them and "count" the issues.

Additional note: In pymupdf script I have flag that enables/disables table extraction. I would expect the tables to be more reasonable than in pdftotext. However, pymupdf is waaaaaaaaay slower when extracting tables and even slower when not extracting them. On the other hand it produces more metadata (font, ..) but also extract images at once. On the other hand in pymupdf it is easy to find if some paragraph has illegal character resulted from text being "garbage". Which means we can OCR only that part of text to fix it. I haven't managed to get that properly working for pdftotext.

adamjanovsky commented 8 months ago

Closing this as wont-fix-unless-we-find-time-for-this, thanks for your effort 👍 .

crocs-muni / sec-certs

Evaluate possible switch from pdftotext to PyMuPDF #364