freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
529 stars 144 forks source link

Some text only entries have "Buy on pacer" messages. #2177

Open hughbe opened 2 years ago

hughbe commented 2 years ago

For example, https://www.courtlistener.com/docket/63395241/patterson-v-terraform-labs-pte-ltd/ docket 10 says "This is a text-only entry generated by the court. There is no document associated with this entry": image

Clicking on the link results in the following: image

mlissner commented 2 years ago

yeah, we should do better here. I'd have to check the RSS feeds to see if they provide any indicator when entries are text only. Dockets certainly do and we could probably be good about figuring that out.

hughbe commented 2 years ago

Wondering if we'd be able to do something like removing the "buy on pacer" button if pacer_doc_id is empty?

For example document 26 in the case https://www.courtlistener.com/docket/60985772/us-securities-and-exchange-commission-v-terraform-labs-pte-ltd/:

image

In the API, it has no pacer_doc_id field:

{
   "resource_uri":"https://www.courtlistener.com/api/rest/v3/docket-entries/187873690/",
   "id":187873690,
   "docket":"https://www.courtlistener.com/api/rest/v3/dockets/60985772/",
   "recap_documents":[
      {
         "resource_uri":"https://www.courtlistener.com/api/rest/v3/recap-documents/193416106/",
         "id":193416106,
         "tags":[

         ],
         "absolute_url":"/docket/60985772/26/us-securities-and-exchange-commission-v-terraform-labs-pte-ltd/",
         "date_created":"2022-02-17T16:03:10.350536-08:00",
         "date_modified":"2022-02-18T10:16:33.335640-08:00",
         "sha1":"",
         "page_count":null,
         "file_size":null,
         "filepath_local":null,
         "filepath_ia":"",
         "ia_upload_failure_count":null,
         "thumbnail":null,
         "thumbnail_status":0,
         "plain_text":"",
         "ocr_status":null,
         "date_upload":null,
         "document_number":"26",
         "attachment_number":null,
         "pacer_doc_id":"",
         "is_available":false,
         "is_free_on_pacer":null,
         "is_sealed":null,
         "document_type":1,
         "description":"Order on Motion for Leave to File Document"
      }
   ],
   "date_created":"2022-02-17T16:03:10.325997-08:00",
   "date_modified":"2022-02-18T10:16:33.322343-08:00",
   "date_filed":"2022-02-09",
   "entry_number":26,
   "recap_sequence_number":"2022-02-09.001",
   "pacer_sequence_number":null,
   "description":"ORDER denying 23 Letter Motion for Leave to File Document: THE REQUEST TO STRIKE REPLY BRIEF OR FILE A SUR-REPLY IS DENIED. HOWEVER, THE COURT WILL SCHEDULE ORAL ARGUMENT SHORTLY. (HEREBY ORDERED by Judge J. Paul Oetken)(Text Only Order) (Oetken, J.) (Entered: 02/09/2022)",
   "tags":[

   ]
}

Other potential heuristics are maybe removing the button if the description contains Text Only Order, No Document There is no document associated with this entry etc. (but this could end up removing documents that are filed with an incorrect description)

mlissner commented 2 years ago

I wish we could do it based on pacer_doc_id, but for whatever reason that field is often missing. I think for a time PACER did this differently and looked up documents based on the document number and pacer_docket_id, or something like that. If you dig in the code, you'll see that when we're missing the pacer_case_id, we make the URL like this:

https://ecf.nysd.uscourts.gov/cgi-bin/show_case_doc?26,569722,,,

We could, I suppose, try to use heuristics based on the text, but like you say, it'll have false positives. Seems better to err on the side of false negatives (aka links that don't work), than missing links when we need them. That said, maybe the hueristics are better than I realize.

@johnhawkinson do you have a sense of whether that'd work to figure out which are text-only entries?

Something we could probably do going forward is have a boolean field in our DB for these types of entries, and set it correctly during ingestion of reliable sources (like an RSS feed without a pacer_doc_id is a pretty sure bet it's text only). That'd sort out the ones from the past that are problematic from the new that we know for sure one way or another.