crocs-muni / sec-certs

Tool for analysis of security certificates and their security targets (Common Criteria, NIST FIPS140-2...).
https://sec-certs.org
MIT License
9 stars 7 forks source link

Unpredictable order of data in documents fixed by OCR #279

Closed adamjanovsky closed 1 year ago

adamjanovsky commented 1 year ago

Describe the bug

On ocassion completely unrelated to this project, I needed to run some OCR. I leveraged the function ocr_pdf_file https://github.com/crocs-muni/sec-certs/blob/4d4a1b61f8e77ee77d17eb129086ebed7ecb94ac/sec_certs/utils/pdf.py#L39-L64

and noticed that the resulting documents can have order of the pages somewhat random (sometimes not). I suppose that for the sake of the data processing we don't mind. But as soon as we leverage NLP, the snipplets that are close to page endings / beginnings can make no sense.

Expected behavior

Well, the data should be ordered.

Fix

In my code, I fixed that with

image_paths = [x for x in tmp_dir_path.iterdir() if x.is_file() and x.suffix == ".ppm"]
image_paths = [(x, int(x.stem.split("-")[1].split(".")[0])) for x in image_paths]
image_paths = sorted(image_paths, key=lambda x: x[1])
image_paths = [x[0] for x in image_paths]

same procedure is to be applied on .txt files. Then iterate over these lists.

I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).

merger = PdfMerger()
for pdf_path in pdf_paths:
    merger.append(pdf_path)
J08nY commented 1 year ago

I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).

While tesseract allows for creation of PDFs with the OCRed text overlaid on top of the source image, we do not use this feature for a reason. If we did use it and merged the output single-page PDFs back into a whole PDF again we would somehow need to track whether the report/st PDF is the original one or was replaced by the OCR one. Or we would need a way to store the OCRed PDFs separately and track that, and it is just added complexity. I do not see much benefit to justify that.