audiodude / court-version-scraper

A simple web app which scrapes the PACER court listing for ECF versions and displays them
http://court-version-scraper.herokuapp.com/
MIT License
3 stars 4 forks source link

Examine using the XML data available from the courts? #2

Open mlissner opened 7 years ago

mlissner commented 7 years ago

I just discovered that there is an XML version of the data that this scraper is ingesting. It's not wonderful XML, but it's decent. If we ever rewrite this code, we should see if using it is easier/more reliable.

Here's an example link:

https://ecf.cod.uscourts.gov/cgi-bin/CourtInfo.pl?output=xml

I think we could probably get these links for pretty much every jurisdiction, if we wanted.

audiodude commented 7 years ago

Probably sturdier, but I doubt it will make the scraping much faster. Might as well implement it and see, though.

mlissner commented 7 years ago

Yeah, I'd expect speed to be nearly identical, but maybe the data is more structured/easier to work with?

audiodude commented 7 years ago

So I tried implementing this just now, but the only problem is that I don't have links to the individual courts' /CourtInfo.pl page.

I only have the general PACER courtinfo URLs that look like this: https://www.pacer.gov/psco/cgi-bin/courtinfo.pl?court=E_ALMDC&output=xml

That's because I'm scraping this page to get the list of courts: https://www.pacer.gov/psco/cgi-bin/links.pl

mlissner commented 7 years ago

I think you could use the domains from each of the URLs on the page that lists the courts. For example, Alabama links to:

https://ecf.almd.uscourts.gov/

If you just tack /cgi-bin/CourtInfo.pl?output=xml on the end, it seems to work:

https://ecf.almd.uscourts.gov/cgi-bin/CourtInfo.pl?output=xml

Could that work?

audiodude commented 7 years ago

That could definitely work.

Unfortunately, I don't think the XML version has any info that the web page doesn't, for example for the U.S. Bankruptcy Court for the Western District of North Carolina. the go live date is missing on the web page and is also missing from the XML: https://ecf.ncwb.uscourts.gov/cgi-bin/CourtInfo.pl?output=xml

mlissner commented 7 years ago

Feels like something to do later, if this is ever rewritten from scratch.

johnhawkinson commented 6 years ago

The XML is definitely missing valuable information. Today I observed:

https://ecf.dcd.uscourts.gov/cgi-bin/CourtInfo.pl

Case Number Format O:YY-TY-#####-INI-RIN example: 1:18-cv-00374
RSS FeedDocket entries of type: all Last 24 hours' entries - Internet

but https://ecf.dcd.uscourts.gov/cgi-bin/CourtInfo.pl?output=xml:

<opt>
  <CaseNo>O:YY-TY-#####-INI-RIN &lt;i&gt;example:&lt;/i&gt; 1:18-cv-00374
</CaseNo>

Also, just to record my note from Twitter this afternoon: https://twitter.com/johnhawkinson/status/965717770120855552

I guess I sort of wish @audiodude's scraper https://court-version-scraper.herokuapp.com/ tracked historical versions and also presented the entire XML. Was this the change from rss_outside.pl to readyDockets?

Although I guess I should have written s/XML/HTML/