euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.24k stars 1.13k forks source link

PDF Miner returns different results every time #306

Open aleksandar-devedzic opened 3 years ago

aleksandar-devedzic commented 3 years ago

I have noticed the issue with PDF miner. It returns different results each time for my PDF doc. This is my code:

import requests
from io import BytesIO
from pdfminer import high_level

def pdf_sublink_extraction(pdf_links, sleep):

    associatedTextList = []
    for pdf_link in pdf_links:
        print("pdf link", pdf_link, '\n')
        try:
            response = requests.get(pdf_link)
            print('response', response, '\n')
            with BytesIO(response.content) as data:

                num_of_pages = len(list(high_level.extract_pages(data)))

                full_pdf_text = high_level.extract_text(data, password='', page_numbers = None, maxpages = 5, codec='utf-8', caching=True, laparams=None)
                full_pdf_text = full_pdf_text.replace('\n\n\n\n', '\n').strip()

        except:
            full_pdf_text = "PDF File: " + pdf_link + "\n\nUnable to parse PDF file!"

    return full_pdf_text

print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))
print()
print()
print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))

I checked the results with this tool: https://www.diffchecker.com/diff

And it returns different results. The difference is in numbers in some lines.

Is that a bug, or Im doing something wrong?

kriffe commented 3 years ago

If you run python version less than 3.7 you might get non deterministic behavior. https://stackoverflow.com/questions/14956313/why-is-dictionary-ordering-non-deterministic

Try upgrading to 3.7 and see if it runs more consistent