Gather the biggest documents from PACER

freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.

https://www.courtlistener.com

Other

553 stars 151 forks source link

Gather the biggest documents from PACER #4356

Open mlissner opened 3 months ago

mlissner commented 3 months ago

A client, https://github.com/freelawproject/crm/issues/121, would like to gather the biggest documents available in PACER.

Goal is to buy about 50k docs over 1,000 pages. We'll do this in two stages:

We'll buy the documents we know about in our database. Currently that's 3,227 documents.
We'll identify documents above this size and buy them. The best way I can think of to do this is to iterate over receipt pages court by court and one by one (round-robining the courts to reduce impact).

Recent docs are more useful than older ones, so we'll want to start at the highest pacer_doc_id we have for each court in our DB and work our way backwards.

We'll start with 5k docs, then pause for analysis with the client.

mlissner commented 3 months ago

One thought: Many courts won't allow big or long documents, so we should only crawl those that do. We know which do based on this query (or a similar one in the DB):

https://www.courtlistener.com/?type=r&q=page_count%3A%5B1000%20TO%2010000%5D&type=r&order_by=score%20desc

ttys0dev commented 3 months ago

I think we also parse page counts from attachment pages right?

mlissner commented 3 months ago

We do get counts from attachment pages, yeah, and they'd certainly be preferable when we can get them.

I've noticed that attachment pages are always fast, but that document receipt pages are not. I hypothesize that on PACER's back end, the document is retrieved and put in a temporary location when the receipt page is loaded, so crawling all receipt pages will move files around inside PACER's software, while attachment pages do not.

What I forget (if I ever knew) is whether you can request attachment pages for entries that don't have attachments.