TalkAboutLocal / local-news-engine

GNU Affero General Public License v3.0
14 stars 2 forks source link

"Get a courts file" #99

Open psychemedia opened 7 years ago

psychemedia commented 7 years ago

What is the source for courts files?

I previously had access to a couple of court files from a media org and produced a very ropey scraper that generated a set of data tables from them.

Would be interesting to compare notes and uses? Also robustness of scraper? Ethics of structuring and retaining/processing information?

I wonder if creating a thread that describes the structure of info that can be extracted from PDFs, without revealing actual data, might be useful? Also production of a set of dummy PDFs structured similarly to official docs, to act as a foil for discussion about:

robredpath commented 7 years ago

Hi @psychemedia - I think I can answer a few of your points!

What is the source for courts files?

We operated LNE for a few weeks using courts data supplied to us from Talk About Local, a media organisation with whom we had a data processing agreement. They in turn received it directly in electronic form from the courts as part of the weekly distribution of the list.

scraper that generated a set of data tables from them

The scraper that's part of LNE ultimately creates JSON rather than tabular data, but it would be trivial to flatten the JSON as there's nothing overly complex about it.

Would be interesting to compare notes and uses?

I'm always happy to discuss this - I think you already have my work email, otherwise you follow me on Twitter so do feel free to DM me!

Also robustness of scraper?

It's been run on about 10 sets of courts data so far, albeit always from the same court, and the first 4 or 5 times required some tinkering to deal with edge cases. The last couple of times it went smoothly.

Ethics of structuring and retaining/processing information?

This is where it always becomes tricky! We ran a very limited trial with a small number of users and had some ethical discomfort about even that, although ultimately the limited scope assuaged our fears. There could be some horrendous ethical issues around automated services, though!

I wonder if creating a thread that describes the structure of info that can be extracted from PDFs, without revealing actual data, might be useful?

Broadly speaking, the output from the parser is a list of cases, each of which contains a list of fields that we were able to extract - if you take a look at https://github.com/TalkAboutLocal/local-news-engine/blob/master/courts_parse.py - anything that we're doing setResultsName() to is something that we're able to extract if it's in the data.

Hope that's useful!