Add court summary parsing to download.py; fix docket parsing issues

CodeForPhilly / pbf-scraping

Project for Philadelphia Bail Fund to scrape new criminal filings from municipal court

https://codeforphilly.github.io/pbf-scraping

10 stars 4 forks source link

Add court summary parsing to download.py; fix docket parsing issues #58

Closed notchia closed 3 years ago

notchia commented 3 years ago

This should resolve issues 37 and 56:

We now have race and sex data! When download.py is run, the CSV that is saved will include race and sex data in the same entry as the docket file information.
Charge descriptions and statute numbers should no longer be incorrectly concatenated
Found and addressed an error in parsing charges: if charges appeared on multiple pages, only the last page of charges would be returned. This is now fixed.
Added test functions to download.py, parse_docket.py, and parse_court.py
Did some additional refactoring of download.py, funcs_parse.py, parse_docket.py, and parse_court.py

notchia commented 3 years ago

Actually looking again at issue 37 - this doesn't make a separate CSV and AWS bucket for the court summary data, but instead returns the docket and court summary as one CSV, loaded into the original AWS bucket. I'll hold off on this for now to make sure I'm actually providing what's needed, and will probably also split this into multiple smaller pull requests.

adamrlinder commented 3 years ago

@notchia This rocks! I do think splitting it into smaller pull requests probably makes sense.

When we've talked about architecting this, we've talked about having Court Summaries end up in a different folder in S3. I don't think it existed yet, so I've created it at pbf-pdf-dockets/court-summary-data. We may want to talk about this on Tuesday's call to make sure everyone (including me!) understands what the structure is supposed to be.