Open mlissner opened 3 months ago
One thought: Many courts won't allow big or long documents, so we should only crawl those that do. We know which do based on this query (or a similar one in the DB):
I think we also parse page counts from attachment pages right?
We do get counts from attachment pages, yeah, and they'd certainly be preferable when we can get them.
I've noticed that attachment pages are always fast, but that document receipt pages are not. I hypothesize that on PACER's back end, the document is retrieved and put in a temporary location when the receipt page is loaded, so crawling all receipt pages will move files around inside PACER's software, while attachment pages do not.
What I forget (if I ever knew) is whether you can request attachment pages for entries that don't have attachments.
A client, https://github.com/freelawproject/crm/issues/121, would like to gather the biggest documents available in PACER.
Goal is to buy about 50k docs over 1,000 pages. We'll do this in two stages:
We'll buy the documents we know about in our database. Currently that's 3,227 documents.
We'll identify documents above this size and buy them. The best way I can think of to do this is to iterate over receipt pages court by court and one by one (round-robining the courts to reduce impact).
Recent docs are more useful than older ones, so we'll want to start at the highest
pacer_doc_id
we have for each court in our DB and work our way backwards.We'll start with 5k docs, then pause for analysis with the client.