parse_docket.py fails on any newly downloaded docket PDF files

CodeForPhilly / pbf-scraping

Project for Philadelphia Bail Fund to scrape new criminal filings from municipal court

https://codeforphilly.github.io/pbf-scraping

10 stars 4 forks source link

parse_docket.py fails on any newly downloaded docket PDF files #63

Closed irishryoon closed 3 years ago

irishryoon commented 3 years ago

I have two PDF files of the same docket file, one downloaded at end of October ('old_docket.pdf') and one downloaded yesterday ('new_docket.pdf'). new_docket.pdf old_docket.pdf

'parse_docket.py' runs fine on 'old_docket.pdf'. However, when I run it on 'new_docket.pdf', parsing fails, and returns the following warnings

Warning: could not parse docket_no
Warning: could not parse dob
Warning: could not parse arrest_date
Warning: could not parse case_status
Warning: could not parse arresting_officer
Warning: could not parse attorney
Warning: could not parse prelim hearing date/time

Similarly, parsing fails on any new docket files I download.

My guess is that PyPDF2 runs fine on the old docket PDF but not on the new docket file.

Not sure if this is specific to my environment. Can anyone else reproduce the issue?

bertamb commented 3 years ago

I get the same error. It seems to be a problem with PyPDF2 as you suspected. I am able to read the new_docket.pdf just fine with pdfquery. We can use pdfquery instead of PyPDF2, but we probably need to redefine all the regular expressions, because the text is parsed in a different way.

irishryoon commented 3 years ago

Hi @bertamb would you be interested in taking this issue? I can also help with some of the regex stuff if needed.

adamrlinder commented 3 years ago

How annoying! Just confirming that as I start to work on downloading the rest of the 2020 dockets and court summaries the script is indeed failing to process each new one.

notchia commented 3 years ago

How annoying! Just confirming that as I start to work on downloading the rest of the 2020 dockets and court summaries the script is indeed failing to process each new one.

BTW @bertamb's parse_court.py still works fine on the new court summaries - the only reason download.py is failing on the court summaries is that I set it to only save both the docket and court summary together (not only one or the other if one of them fails), to make it easier to keep track of which ones we have all the information for.