Open marieke-woensdregt opened 6 years ago
Hi Marieke,
I have a suspicion this is a pypdf issue, could you make the pdf available to me and I will try to debug?
Regards Jochen
I have a similar problem, I can process about 30 of my 1300 PDFs from Mendeley, but at one PDF your code (resp. the pypdf library) seems to be struggling. I assume while reading, but just guessing.
Traceback (most recent call last):
File "menextract2pdf.py", line 184, in <module>
mendeley2pdf(fn, dir_pdf)
File "menextract2pdf.py", line 168, in mendeley2pdf
processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons)
File "menextract2pdf.py", line 147, in processpdf
inpdf._flatten()
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1570, in _flatten
pages = catalog["/Pages"].getObject()
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 518, in __getitem__
return dict.__getitem__(self, key).getObject()
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 179, in getObject
return self.pdf.getObject(self).getObject()
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1657, in getObject
retval = self._getObjectFromStream(indirectReference)
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1607, in _getObjectFromStream
streamData = BytesIO(b_(objStm.getData()))
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 843, in getData
decoded._data = filters.decodeStreamData(self)
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/filters.py", line 360, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/filters.py", line 113, in decode
data = decompress(data)
File "/Users/ludwig/miniconda2/lib/python2.7/site-packages/PyPDF2/filters.py", line 51, in decompress
return zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
Unfortunately your logging doesn't say which PDF file it is trying to open. I have no knowledge of python, otherwise I would log/print it myself...
Could you maybe add a quick verbose mode or something else so that we can identify the problematic PDFs and send them to you for inspection?
Thanks a lot! Let's leave Mendeley! :)
Hi Ludzeller,
you can try to put print(fn)
before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons)
on line 177 in menextract2pdf.py.
Hi cycomanic,
Thank you very much for creating this tool! Very handy, especially now that Mendeley has started encrypting it's local database.
I'm running Menextract2pdf.py on Mac, and have installed the dependencies you listed. I'm extracting pdfs from Mendeley version 1.18. Menextract2pdf.py runs perfectly fine for a bunch of papers, but at some point it hits a pdf of a book chapter and gives me the following error:
Traceback (most recent call last): File "menextract2pdf.py", line 184, in <module> mendeley2pdf(fn, dir_pdf) File "menextract2pdf.py", line 168, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "menextract2pdf.py", line 141, in processpdf inpdf = PyPDF2.PdfFileReader(open(fn, 'rb'), strict=False) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__ self.read(stream) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1696, in read raise utils.PdfReadError("EOF marker not found") PyPDF2.utils.PdfReadError: EOF marker not found
(Apologies for the messed-up formatting.)
Thanks, Marieke