CodeForPhilly / pbf-scraping

Project for Philadelphia Bail Fund to scrape new criminal filings from municipal court
https://codeforphilly.github.io/pbf-scraping
10 stars 4 forks source link

Distinct charges and statutes concatenated to one string #57

Closed irishryoon closed 3 years ago

irishryoon commented 3 years ago

In some dockets, distinct offenses and statutes are concatenated into one string.

For example, when I run 'parse_docket.py' on the attached docket file, it returns the following offenses:['Murder', 'Criminal Attempt - Murder Criminal Attempt - Murder', 'Criminal Attempt - Murder Conspiracy', 'Conspiracy Conspiracy', 'Conspiracy Conspiracy', ... ] Note that some (but not all) distinct offenses appear in the same string, such as 'Criminal Attempt - Murder Conspiracy'.

Similarly, when I run 'parse_docket.py' on the same docket file, it returns the following list of statutes: ['18 § 2502', '18 § 901 §§ A 18 § 901 §§ A', ... ] Again, some statutes have been concatenated into one string. For example, '18 § 901 §§ A 18 § 901 §§ A',

13270.pdf

notchia commented 3 years ago

I can take a stab at fixing this! I've been reading through/refactoring funcs_parse.py anyway.

notchia commented 3 years ago

I'm not able to reproduce this error on your example pdf. Could you try fetching the newest version of the master branch and seeing if this error is still occurring?

notchia commented 3 years ago

FYI I did find another error though, which is that if charges appeared on multiple pages, only the last page of charges would be returned. I will fix this also.

irishryoon commented 3 years ago

I just tried again with the latest master branch, and I still get the same output. I'm attaching the output csv file here - you can see that under the 'offense' and the 'statute' column, some (but not all) items are concatenated.

For example, the third item under 'offense' currently appears as 'Criminal Attempt - Murder Conspiracy'. This is supposed to be two separate items: 'Criminal Attempt - Murder' and 'Conspiracy'.

Similarly, the second item under 'statute' currently appears as '18 § 901 §§ A 18 § 901 §§ A'. This is supposed to be two separate items '18 § 901 §§ A' and '18 § 901 §§ A'.

Let me know if you're able to reproduce the result

output.csv.zip

notchia commented 3 years ago

Huh, you're right, it's occurring on the master branch, but not in the version I'm working with that's a few commits ahead, so I apparently fixed the issue even though I didn't think I'd worked on anything that would address it (and I'm still not quite sure where the issue comes from). I'll PR and merge the fix today.