freelawproject / recap

This repository is for filing issues on any RECAP-related effort.
https://free.law/recap/
12 stars 4 forks source link

Attachment pages aren't getting pacer_case_id values #226

Closed mlissner closed 6 years ago

mlissner commented 6 years ago

This seems bad. Instead they're getting sent with the strings undefined or null for their pacer_case_id.

I'll be able to fix this for old stuff by yanking the ID out of the links or the button, and https://github.com/freelawproject/juriscraper/releases/tag/1.12.3 does exactly that, but something needs to be done about the RECAP client.

mlissner commented 6 years ago

Another wrinkle. Of the ~9k attachments we've received, only about 2k have HTML files associated with them. The remainder do not. This can be either because the HTML file was pulled off during an earlier processing run that crashed (which I've seen) or because it wasn't uploaded in the first place.

mlissner commented 6 years ago

OK, this is actually kind of not too bad. First, the reason most of these didn't have documents associated with them is because they were processed successfully. I assumed that if they didn't have a valid case ID they'd fail, but that was wrong. These were getting processed just fine even with bad case ID values — it just isn't used.

The 2k items that still have documents associated with them are ones that are for dockets that we don't have for some reason. So that's a lot, but it's not a problem unique to attachment pages.

The fix here, then, is:

The rest of the old data is fine, more or less.

mlissner commented 6 years ago

Looking at the client side of this, we find the following fun comment:

// pacer_case_id is not currently used by backend, but send anyway...

So...we knew it wasn't used by the backend. Regardless, the reason it wasn't getting set is because the only source we had for it was the current URL, which is typically of this form:

https://ecf.dcd.uscourts.gov/doc1/04506319222

Or the referrer URL, which is of this form:

https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?415844360161432-L_1_0-1

Ain't no case numbers there, so we would just send "undefined." In the changes that just landed (above), I've added a new function that will suck the pacer_case_id value from the forms on the page, if that's possible.