freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
357 stars 106 forks source link

Fill `coloctapp` gaps #979

Open grossir opened 6 months ago

grossir commented 6 months ago

From #929 , related to #974

coloctapp

Between September 29, 2021 and February 02, 2022 we have 0 documents. We are missing documents, but must go into PDFs to get them now

grossir commented 5 months ago

@flooie can you please check this source? From what I see we may need to parse the pdfs, since old case information is not available on HTML, as it is in colo

flooie commented 5 months ago

Hmm. Is it possible the html changes.

grossir commented 5 months ago

Some news about coloctapp, the Colorado Courts have just (well, on March 1, 2024) launched a new site for Appellate Opinions, and it actually has past opinions in HTML. We could implement the backscraper from there instead of dealing with PDFs

Check it out here

mlissner commented 5 months ago

CO was one of the worst states. Does this mean it's finally not so terrible?

grossir commented 5 months ago

This new Colorado site seems to have no search filters except for "court". Getting the document url requires more steps/requests. And it uses vlex as the backend. The downloaded opinion PDF comes in a zip, and the document has a vlex link in it image

So, I don't know if it qualifies as not being terrible, but at least it will let us look for past opinions without going into PDFs

grossir commented 5 months ago

So, we also have a more recent gap. We are missing every Opinion announced on:

I don't know why the scraper has been failing...

grossir commented 4 months ago

Command to fill the gaps

docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.coloctapp --backscrape-start=09/28/2021 --backscrape-end=02/01/2022
grossir commented 2 months ago

The old scraper went down some months ago. Most recent colotctapp opinion is from March 7th, 2024, so this is a new gap