freelawproject / recap

This repository is for filing issues on any RECAP-related effort.
https://free.law/recap/
12 stars 4 forks source link

If we don't know the case number, don't upload and don't show "undefined" in filename #177

Closed mlissner closed 6 years ago

mlissner commented 6 years ago

Right now if we don't know the case number for a case, we end up showing a download link that says, roughly:

gov.uscourts.dcd.undefined.44.0.pdf

That's lame.

And we upload the file to the RECAP Archive, where is undoubtedly gets rejected, since we don't know the pacer_case_id number. So...let's not do this?

johnhawkinson commented 6 years ago

Why reject it? Aren't document IDs unique per ECF instance? Hold it until some docket is uploaded that claims that document number.

Let's shoot for what's right and only compromise if we have to.

johnhawkinson commented 6 years ago

Also, this is the same as #174 right? Let's close that / dupe it into this one.

And there are a lot of places where a user might download a document without RECAP knowing the case number. It's an annoyance I've lived with for years, but I'd love for it to go away.

it was really nice to see the blue [R] in a written opinion report -- hrmm...I had an example of that and it seems to have gone away. I'm confused how that happened and now it has disappeared....sometimes I wish I had a browser that screenshotted every page :) Update: Oh, right, wrong extension version. Like so:

screen shot 2017-11-02 at 19 31 11

That [R] did not appear in the classic FF extension.

mlissner commented 6 years ago

174 is about capturing the data from the written opinions report. This one is about not uploading and about having better file names.

As for doing the ideal — uploading and holding the item until a docket comes along — we could file a bug for that on CourtListener. It's been suggested before, but...it's complicated. We'd have to be careful to always avoid these items when we do things like add stuff to the search engine, or serve stuff in the API. So...to have them is fine, but we have to make sure they don't end up anywhere they shouldn't. Maybe they get their own DB table and that's that?

johnhawkinson commented 6 years ago

I'm not sure why it's complicated. So what if such items were searched or returned in the API? I don't see the problem. But regardless, the client should not decide it knows better and fail to upload them. (In fact, the docket might already have been uploaded. The server might already know the document ID).

OK in the #174 as adding more client parsing support and #176 about potentially limiting uploads (IMNSHO a v. bad idea).

mlissner commented 6 years ago

There's no point in uploading something if the server can't handle it. At the moment neither the old recap server, nor CourtListener knows what to do with a PDF that's orphaned from its case number. So we can push them (wasting bandwidth), but either server will just reject it.

I think it's better not to push something the server can't handle. The client shouldn't make invalid requests of the server. There's just no point. If we want to open another ticket to add this support to CL and to undo this bug, that seems reasonable to me. But until CL supports orphaned documents, I don't see the point of making invalid requests.

mlissner commented 6 years ago

OK, I recant! The server will now support uploads without pacer_case_id values, and it will try to look them up by the pacer_doc_id. If that fails, all bets are off, but it's one more way we might get some content.

Next step, better file names in this instance, and this should be resolved.

johnhawkinson commented 6 years ago

I don't think this was a great fix. We went from having gov.uscourts.dcd.undefined.44.0.pdf to gov.uscourts.dcd.unknown-case-id.44.0.pdf

To me that's actually worse -- the three words separated by hyphens look like there is more structured content to parse here (cf. Miller, George A. The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information; Psychological Review 1994, Vol. 101, No. 2,343-352), and the 15-character static string is out of sync with the typical 6-character strings (gov.uscourts.dcd.160370.63.0.pdf) whereas undefined is a 9-char string that fit better. Just plain gov.uscourts.dcd.unknown.63.0.pdf would feel better, I think..

More significantly, this produces non-unique filenames. There are a lot of cases with a document 44. So this naming mechanism encourages filename duplication, which is bad.

This is also a change from recap-firefox. Historically it would name files with the doc1 number if it didn't know the caseid, e.g.: mad-09504412995.pdfor sometimes just 09504412995 (no extension?!).

I'm not sure what I really want here. My inclination is to say we should use what we know, and if we lack case information, we should record the court name and the doc1 number as well as the docket number. So that suggests gov.uscourts.mad.unknown-09504412995.44.0.pdf.

That's not 100% consistent with all of my comments above, and I don't know that it's the best choice.