Unpredictable order of data in documents fixed by OCR

Describe the bug

On ocassion completely unrelated to this project, I needed to run some OCR. I leveraged the function ocr_pdf_file https://github.com/crocs-muni/sec-certs/blob/4d4a1b61f8e77ee77d17eb129086ebed7ecb94ac/sec_certs/utils/pdf.py#L39-L64

and noticed that the resulting documents can have order of the pages somewhat random (sometimes not). I suppose that for the sake of the data processing we don't mind. But as soon as we leverage NLP, the snipplets that are close to page endings / beginnings can make no sense.

Expected behavior

Well, the data should be ordered.

Fix

In my code, I fixed that with

image_paths = [x for x in tmp_dir_path.iterdir() if x.is_file() and x.suffix == ".ppm"]
image_paths = [(x, int(x.stem.split("-")[1].split(".")[0])) for x in image_paths]
image_paths = sorted(image_paths, key=lambda x: x[1])
image_paths = [x[0] for x in image_paths]

same procedure is to be applied on .txt files. Then iterate over these lists.

I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).

merger = PdfMerger()
for pdf_path in pdf_paths:
    merger.append(pdf_path)

crocs-muni / sec-certs

Unpredictable order of data in documents fixed by OCR #279