Open josh-chamberlain opened 10 months ago
@josh-chamberlain I talked with Max Chis and I'm interested in working through this issue. I'll message you on discord asking for the endpoint.
@michaeldepace thanks! I shared details via DM, but putting here too:
For starters, we would want to do a scraper in the scrapers repository, which writes to CSV or JSON or SQLite or something depending on what’s convenient for you. Once it works locally and seems to do what it needs to, we can worry about scale and stuff. For this case I’m not sure how often we need it to run.
This is a complicated one—let me know if you want to chat or need anything!
We got some info from a person who used a different endpoint previously to create a database with these tables:
Cases
Defendants
Case Calendar
Confinements
Offenses
Bails
Officers
Police Departments
Judges
Courts
Record counts
Here's some edited explanation:
Keep in mind that while filing date, docket number, court office, and short caption (plaintiff v. defendant) are available from the overview for all cases, and municipalities and zip codes of participants are usually (though not always) on the docket sheets, exact addresses are not publicly available.
Also keep in mind that while the fields present in the overview all fit neatly into a table with one string per field, the data from the docket sheets does not.
Each entry in the SQLite file is a python defaultdict. This was necessary since many of the fields in the docket are nested, of variable length, and need to be incrementally modified each time the docket sheet is scraped. When most of the rest of the system, including the overview entries for each case, was converted to use postgres SQL, I was not able to figure out a workable way of putting the data in the defaultdicts into a postgres table. Accessing the data in mdj_docket_cases.db SQLite file would, I expect, be possible but require you to use SqliteDict and defaultdict from python 3.
It may be simpler to select docket numbers of interest and then use the PAeDocket web API for whatever details you want.
In case it might be helpful, I have included an example of what overview data (docket_table) vs the parsed docket sheet data (cases_table) for a random case look like in a python notebook vs what the PAeDocket web API returns. Note that the docket_table entries are not updated after the overview is initially scraped, so those are only valid as of the date in the 'Scrape Date' field, and that the data in docket_table (but not cases_table) is available in the postgres DB.
Given the nested nature of the data and flexible nature of our questions + purposes, what if we just put the JSON, mostly unaltered, into a place where elasticsearch could get at it? We could even use elastic cloud to test it out, rather than hosting our own.
Context
existing case search: https://ujsportal.pacourts.us/CaseSearch endpoint:
only sharing with engaged volunteers
Initial work required
start date
andn
number of cases to getn
cases by docket number (reasonable timeout)n
cases?" (Y/n)next milestone
Risks
How to start
Related questions
All of these concern Allegheny County; we believe all of them can only be answered with the aid of court docket analysis.