Closed irishryoon closed 3 years ago
I get the same error. It seems to be a problem with PyPDF2 as you suspected. I am able to read the new_docket.pdf just fine with pdfquery. We can use pdfquery instead of PyPDF2, but we probably need to redefine all the regular expressions, because the text is parsed in a different way.
Hi @bertamb would you be interested in taking this issue? I can also help with some of the regex stuff if needed.
How annoying! Just confirming that as I start to work on downloading the rest of the 2020 dockets and court summaries the script is indeed failing to process each new one.
How annoying! Just confirming that as I start to work on downloading the rest of the 2020 dockets and court summaries the script is indeed failing to process each new one.
BTW @bertamb's parse_court.py
still works fine on the new court summaries - the only reason download.py
is failing on the court summaries is that I set it to only save both the docket and court summary together (not only one or the other if one of them fails), to make it easier to keep track of which ones we have all the information for.
I have two PDF files of the same docket file, one downloaded at end of October ('old_docket.pdf') and one downloaded yesterday ('new_docket.pdf'). new_docket.pdf old_docket.pdf
'parse_docket.py' runs fine on 'old_docket.pdf'. However, when I run it on 'new_docket.pdf', parsing fails, and returns the following warnings
Similarly, parsing fails on any new docket files I download.
My guess is that PyPDF2 runs fine on the old docket PDF but not on the new docket file.
Not sure if this is specific to my environment. Can anyone else reproduce the issue?