cycomanic / Menextract2pdf

Extract Mendely annotations to PDF FIles
GNU General Public License v3.0
35 stars 15 forks source link

PyPDF2.utils.PdfReadError: EOF marker not found #8

Open marieke-woensdregt opened 6 years ago

marieke-woensdregt commented 6 years ago

Hi cycomanic,

Thank you very much for creating this tool! Very handy, especially now that Mendeley has started encrypting it's local database.

I'm running Menextract2pdf.py on Mac, and have installed the dependencies you listed. I'm extracting pdfs from Mendeley version 1.18. Menextract2pdf.py runs perfectly fine for a bunch of papers, but at some point it hits a pdf of a book chapter and gives me the following error:

Traceback (most recent call last): File "menextract2pdf.py", line 184, in <module> mendeley2pdf(fn, dir_pdf) File "menextract2pdf.py", line 168, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "menextract2pdf.py", line 141, in processpdf inpdf = PyPDF2.PdfFileReader(open(fn, 'rb'), strict=False) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__ self.read(stream) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1696, in read raise utils.PdfReadError("EOF marker not found") PyPDF2.utils.PdfReadError: EOF marker not found

(Apologies for the messed-up formatting.)

Thanks, Marieke

cycomanic commented 6 years ago

Hi Marieke,

I have a suspicion this is a pypdf issue, could you make the pdf available to me and I will try to debug?

Regards Jochen

ludzeller commented 6 years ago

I have a similar problem, I can process about 30 of my 1300 PDFs from Mendeley, but at one PDF your code (resp. the pypdf library) seems to be struggling. I assume while reading, but just guessing.

Traceback (most recent call last):
  File "menextract2pdf.py", line 184, in <module>
    mendeley2pdf(fn, dir_pdf)
  File "menextract2pdf.py", line 168, in mendeley2pdf
    processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons)
  File "menextract2pdf.py", line 147, in processpdf
    inpdf._flatten()
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1570, in _flatten
    pages = catalog["/Pages"].getObject()
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 518, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 179, in getObject
    return self.pdf.getObject(self).getObject()
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1657, in getObject
    retval = self._getObjectFromStream(indirectReference)
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1607, in _getObjectFromStream
    streamData = BytesIO(b_(objStm.getData()))
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 843, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/filters.py", line 360, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/filters.py", line 113, in decode
    data = decompress(data)
  File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/filters.py", line 51, in decompress
    return zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

Unfortunately your logging doesn't say which PDF file it is trying to open. I have no knowledge of python, otherwise I would log/print it myself...

Could you maybe add a quick verbose mode or something else so that we can identify the problematic PDFs and send them to you for inspection?

Thanks a lot! Let's leave Mendeley! :)

cycomanic commented 6 years ago

Hi Ludzeller,

you can try to put print(fn) before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py.