cycomanic / Menextract2pdf

Extract Mendely annotations to PDF FIles
GNU General Public License v3.0
35 stars 15 forks source link

zlib.error: Error -3 while decompressing data: incorrect header check #12

Open ammoniac1984 opened 5 years ago

ammoniac1984 commented 5 years ago

Hi, I am a long-time mac Mendeley user, but I have become extremely fed up with the various bugs and limitations of Mendeley so I have decided to try to switch to Zotero. The problem is I have 10 years of annotated (and highlighted) PDFs I cannot lose in the conversion process. I have tried running the .sh from my macOs Sierra terminal but it does not work. the only command that starts some sort of process is:

python3 menextract2pdf.py mydatabase.sqlite mypdffolder/ --overwrite

The overwriting of pdfs works for a while and about a third of my 2800 files get modified with the highlighting as it should. but then the process stops and I get the following error message:

Traceback (most recent call last): File "menextract2pdf.py", line 193, in mendeley2pdf(fn, dir_pdf) File "menextract2pdf.py", line 177, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "menextract2pdf.py", line 156, in processpdf inpdf._flatten() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1506, in _flatten pages = catalog["/Pages"].getObject() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 516, in getitem return dict.getitem(self, key).getObject() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 178, in getObject return self.pdf.getObject(self).getObject() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1593, in getObject retval = self._getObjectFromStream(indirectReference) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1543, in getObjectFromStream streamData = BytesIO(b(objStm.getData())) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 841, in getData decoded._data = filters.decodeStreamData(self) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData data = FlateDecode.decode(data, stream.get("/DecodeParms")) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 111, in decode data = decompress(data) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 49, in decompress return zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check

Thanks in advance for your help! Max

cycomanic commented 5 years ago

Hi Max, this looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting

print(fn)

before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py. If you could share the file I can try to debug. Cheers Jochen

ammoniac1984 commented 5 years ago

Hi Jochen

Thank you so much for addressing my problem! I appreciate it very much. I added "print(fn)” before line 177 and I ran the .py file again and I get the same error message. I do not see a difference in the printed output in my terminal window.

Here is a link to a fairly recent backup of my database:

I had upgraded to the newest version of Mendeley which encrypted the database so I had to look for older backups from the spring before the update. This database file is the copy I had on my office computer. Not the same one with which I worked last week when I posted this query on GitHub, but running menextract2pdf on this database produces the same error as the other version I have at home. The only difference is that the script does not seem to process the bibliographic entries in the same order, so it appears as if it does not stop on the same entry (but that might not be the case and just me not understanding how the script works).

Thanks again!

Cheers!

Maxime

On 8 Oct 2018, at 08:51, Jochen Schröder notifications@github.com wrote:

Hi Max, this looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting

print(fn) before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py. If you could share the file I can try to debug. Cheers Jochen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fPJduZymxgOyHIbCXO4eJ-DfyvOjks5uiwPrgaJpZM4XIcNB.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: Hi Max,\r\nthis looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting \r\npython\r\nprint(fn)\r\n\r\nbefore the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py. If you could share the file I can try to debug.\r\nCheers\r\nJochen"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"cycomanic/Menextract2pdf\",\n\"issueId\": 12,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"cycomanic/Menextract2pdf\",\n\"issueId\": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 388088641\n}" } ], "themeColor": "26292E" } ]

cycomanic commented 5 years ago

Hi Maxime,

the print statement should give us the filename of the offending pdf file, not fix the error. Can you copy paste the full error, I suspect the filename simply got lost in all the output. Unfortunately the database does not help as the error is related to one of the PDF files.

ammoniac1984 commented 5 years ago

Hi Jochen,

Thank you for your reply. Here is the complete printout copy/pasted from my terminal window. Can you see something in there?

Thanks!

Maxime

...

cycomanic commented 5 years ago

So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?

ammoniac1984 commented 5 years ago

Hi, This?

/Users/maxime/Library/Application Support/Mendeley Desktop/Downloaded/Gingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf Traceback (most recent call last): File "/Users/maxime/Downloads/Menextract2pdf-master/menextract2pdf.py", line 185, in mendeley2pdf(fn, dir_pdf) File "/Users/maxime/Downloads/Menextract2pdf-master/menextract2pdf.py", line 169, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "/Users/maxime/Downloads/Menextract2pdf-master/menextract2pdf.py", line 147, in processpdf inpdf._flatten() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1506, in _flatten pages = catalog["/Pages"].getObject() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/generic.py", line 516, in getitem return dict.getitem(self, key).getObject() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/generic.py", line 178, in getObject return self.pdf.getObject(self).getObject() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1593, in getObject retval = self._getObjectFromStream(indirectReference) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1543, in getObjectFromStream streamData = BytesIO(b(objStm.getData())) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/generic.py", line 841, in getData decoded._data = filters.decodeStreamData(self) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData data = FlateDecode.decode(data, stream.get("/DecodeParms")) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/filters.py", line 111, in decode data = decompress(data) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/filters.py", line 49, in decompress return zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check

Le 9 oct. 2018 à 11:38, Jochen Schröder notifications@github.com a écrit :

So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fHULQlWcgtaqqiACTARy6-xM_oGtks5ujHykgaJpZM4XIcNB.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"cycomanic/Menextract2pdf\",\n\"issueId\": 12,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"cycomanic/Menextract2pdf\",\n\"issueId\": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 388088641\n}" } ], "themeColor": "26292E" } ]

cycomanic commented 5 years ago

Could you share the file: ingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf That's seems to be the one causing the issues

ammoniac1984 commented 5 years ago

Hi, Yes, here it is: https://www.dropbox.com/s/3zsa41sxd360hjf/Gingras%20-%202010%20-%20Naming%20without%20Necessity%20On%20the%20Genealogy%20and%20Uses%20of%20the%20Label%20%E2%80%98Historical%20Epistemology%E2%80%99.pdf?dl=0 https://www.dropbox.com/s/3zsa41sxd360hjf/Gingras%20-%202010%20-%20Naming%20without%20Necessity%20On%20the%20Genealogy%20and%20Uses%20of%20the%20Label%20%E2%80%98Historical%20Epistemology%E2%80%99.pdf?dl=0

Just an observation. When I ran the script on a different version of the database (at work), it would block at another file. I have not been able to understand in which order does the script deal with the files.

Thanks again!!

Maxime

Le 14 oct. 2018 à 15:06, Jochen Schröder notifications@github.com a écrit :

Could you share the file: ingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf That's seems to be the one causing the issues

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fCmajYOzXJazW6n9ae7KMXooXOVCks5uk0TLgaJpZM4XIcNB.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: Could you share the file: \r\ningras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf\r\nThat's seems to be the one causing the issues\r\n"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"cycomanic/Menextract2pdf\",\n\"issueId\": 12,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"cycomanic/Menextract2pdf\",\n\"issueId\": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 388088641\n}" } ], "themeColor": "26292E" } ]

folofjc commented 4 years ago

I am getting the same error on a pdf. I can open in acrobat and in mendeley just fine. I can also just manually export this file with the annotations myself through mendeley.

Is it possible to somehow just keep going through all the files and then I can manually go export the ones that fail manually? How do I get it to not crash on this error but just skip this file?

Thanks!

folofjc commented 4 years ago

Also, it looks like for both me and @ammoniac1984, it is happening when pypdf2 thinks the file is encrypted. There is a comment in menextract2pdf.py that says that overriding the encryption worked in the one case you saw. Maybe that is not working for us?

folofjc commented 4 years ago

So the pdf that it is hiccuping on for me opens in Adobe Acrobat and Evince just fine. However, when I tried to open it with pdftk, it said that it had a password protection and would not open it. Here is what the security details look like in Adobe

Annotation 2019-10-08 202043

So it says that it is encrypted, but opens it just fine. My way around it was to simply make a LaTeX file that simply includes this file and then writes it out. This file is not encrypted. Here is what the file made from LaTeX looks like in Adobe:

Annotation 2019-10-08 202736

I then screwed up by trying to add this to Mendeley and delete the other file, but that deleted all my annotations from the database. I guess the annotations are tied to the specific file?

Luckily I had a backup. Unfortunately, it sync'd to Mendeley's servers first. So I had to disconnect from the internet, copy over my backup of the database, open Mendeley, make a backup, then close Mendeley, reconnect to the internet, open it (at which case it sync'd and re-deleted my annotations). Then I restored (which deleted the database both locally and on the servers), which brought back my annotations (and the "encrypted" file). So then I closed Mendeley before it could sync the new files. Then I replaced the pdf with the unencrypted one, started it again, and it appears to be okay. Then it sync'd the backup (but with the unencrypted pdf) back to their servers. But I think I am okay now.

dchakro commented 3 years ago

I think this issue can be marked as closed as the workaround suggested by @folofjc works. i.e. replace the file with "Password Security" with "No Security" works. What I did (on MacOS) was to print the file as a PDF to desktop (now it had "None" as security listed in file properties in Finder). Then I overwrote the old file with this new file and ran the script again and it worked.

folofjc commented 3 years ago

I don't know how it works on MacOS, but on Windows when you print to PDF it makes it an image, so you would lose any "text as text." The nice thing about going through LaTeX is that if it is text, it keeps it as text.