Open freelawbot opened 9 years ago
Comment by harlanyu Friday Apr 03, 2015 at 15:34 GMT
I think this is worthwhile to explore. It just depends on how much metadata comes with the orphan PDF when it's sent to the server, and whether we're able to line that up with a docket later.
This is important for 2.0
@johnhawkinson, here's the ticket for handling documents that arrive before their docket has.
We're at least capturing this data now, so it's hopefully just a matter of doing something smart with it. Two more suggestions to do:
Check whenever we add new docket entries whether we have documents laying around to associate with them.
Get metadata from the PDF itself as in #198.
Packrat that I am, I have been keeping every PACER document I've ever paid for. I do my best to always A) download with RECAP and B) upload to PlainSite, but I'd estimate that historically RECAP works 50% of the time, and I have probably on occasion forgotten to upload a document to PlainSite.
I've been meaning to build a feature that compares PACER document hashes with known hashes and fills in what's missing on the server, but I've never gotten around to it. That said, if there were a upload-random-PACER-doc feature for RECAP where one could just drag and drop, say, 5,000 documents to see if they're already on the system and index what's new, I'd use that.
Only thinking of this now because it basically requires getting metadata from the PDF, which is (2) above.
I don't think the docket headers alone are enough. Here's an example:
Case 1:16-cv-00745-ESH Document 1-1 Filed 04/21/16 Page 1 of 2
We need something more than this, since the docket number can be repeated across courts. Where I think this feature becomes handy is when we have the pacer_doc_id and the PDF headers. I think with that information, we could merge things in with the rest of the collection.
But we pretty much always know the court.
Maybe @PlainSite does via their file names, but if he just dumped a bunch of files in an upload box, we wouldn't just from their headers.
I'm sorry, I thought we were talking about uploads from the extension exclusively.
The choice to not permit manual uploads was a deliberate architectural decision to reduce fraud and mischief. Probably you should talk to @sjschultze about it. Maybe there should be a few onetime exceptions. This is a deep philosophical question.
Don't worry, we've talked about this and turned down this kind of data dozens of times. The general rule is no, but when people get into the thousands of documents, the value/trust propositions can change.
OK, but shouldn't we avoid confusing the two cases? I think they are separate Issues?
Agree, yep.
Do we need a separate Issue to track creation of a docket entry where a docket already exists?
Docket entries will be taken care of in any solution to this issue.
At the risk of belaboring the obvious, over in #220 I mentioned:
Which reminds me of another question: when the docket parser runs and identifies new doc1 IDs, is there a reason it cannot trigger a search of the processing queue for those documents to see if they've already been uploaded? Maybe this is pointless if #61 is resolved.
And Mike replied:
I think that, or something similar, is the solution to #61, yes.
I…hope not? Or at least, not only that? That is, the uploaded documents should be available regardless of whether the docket info is ever sent.
I think that was implicit, but I just want to make sure it actually is explicit.
I…hope not? Or at least, not only that? That is, the uploaded documents should be available regardless of whether the docket info is ever sent.
I guess there's a question here as to how we want these documents to be available:
I think we could do all of the above if we wanted, but none of this is easy. For example:
One approach is to just not have docket objects associated with the documents in the DB. That'd give the documents a PDF URL, would allow them to be searchable (probably), and would allow them (probably) to appear as [R] icons. BUT....the HTML URLs for documents rely on the ID of the docket object, which we wouldn't have. For example, this is the HTML URL for a document, and it relies on the docket ID of 4214664 and the slug of national-veterans-legal...
to work:
So having documents without docket associations means they'd need their own URL scheme, OR they'd not be able to get an HTML URL (which is maybe OK, but it will be tricky).
All this thinking aloud to say, it's easy to do the first part of this — save PDFs that don't have dockets yet, and add them once a docket becomes available. The second part — creating virtual dockets — seems difficult and risky to me. Hopefully it doesn't happen too frequently either. If that's the case, it might not be worth it.
OK, the first part of this is done in https://github.com/freelawproject/courtlistener/commit/a338c9d8966a29e5680f82d0feab3062495716bd. That'll fix a lot of this issue. The remaining question is whether/how we want to create "virtual dockets".
I guess there's a question here as to how we want these documents to be available:
- Do they have an HTML URL in addition to a PDF URL?
- Are they searchable?
- Do they show up via the extensions as an [R] icon?
- Are they in the regular APIs?
1: Whatever / 2: Whatever / 3: Yes /4: Whatever.
What's really important to me is that I not pay money to download a document that someone else has already downloaded. Everything else is nice-to-have.
I think we could do all of the above if we wanted, but none of this is easy. For example:
I suppose I don't follow why any of these issues are hard. They might require making decisions, but the decisions don't seem hard.
- There's a lot of super-important metadata that we'll be lacking if we only have the document.
So what? We'll get it later (And if we don't, well, we were never going to get it.)
For example, if you're taken to a docket page and it doesn't have the name of the case, how useful is that, even if it lists 10 docs for the case?
I'm not super worried about this. I find that docket page because I have one of these documents. Maybe I got there via a NEF. Anyhow, I have the document, so I know the name of the case. Or I could look inside the document to find it.
Some of this can be improved by improving the client scraping though.
E.g. If I go to an iquery.pl
page and search a case and hit View a Document
without running the docket report, the extension could grab the case name [and last update date?] from the iquery.pl
result.
- One approach is to just not have docket objects associated with the documents in the DB. That'd give the documents a PDF URL, would allow them to be searchable (probably), and would allow them (probably) to appear as [R] icons. BUT....the HTML URLs for documents rely on the ID of the docket object, which we wouldn't have. For example, this is the HTML URL for a document, and it relies on the docket ID of 4214664 and the slug of national-veterans-legal... to work: https://www.courtlistener.com/docket/4214664/1/national-veterans-legal-services-program-v-united-states/
Yeah. So obviously HTML urls would have to adopt a different scheme. It'd be tempting to just use
https://www.courtlistener.com/docket/0/
{doc1number}/
And also make https://www.courtlistener.com/docket/0
nonindexable (or not!).
On the other hand, there's a part of me that says that scheme isn't so great anyhow. It doesn't allow for easily truncating the tail element to get to the parent thing. E.g. you have to rip out the 1
from the middle to get to the case: https://www.courtlistener.com/docket/4214664/national-veterans-legal-services-program-v-united-states/
. If instead the scheme was
https://www.courtlistener.com/docket/4214664/national-veterans-legal-services-program-v-united-states/1/
or just plain https://www.courtlistener.com/docket/4214664/1
then we wouldn't have those issues. (Although we still would have this issue's problem). Also that it's non-optimal when a case caption changes (seems to happen more frequently than I realized...)
So having documents without docket associations means they'd need their own URL scheme, OR they'd not be able to get an HTML URL (which is maybe OK, but it will be tricky).
I tend to think it's OK, I'm not a big user of the HTML document pages.
All this thinking aloud to say, it's easy to do the first part of this — save PDFs that don't have dockets yet, and add them once a docket becomes available. The second part — creating virtual dockets — seems difficult and risky to me. Hopefully it doesn't happen too frequently either. If that's the case, it might not be worth it.
What is the risk?
(Just a reminder, we're in the zone now where we're going beyond what RECAP has ever done. This has been an issue since the very first days of RECAP. The first fix I put in this evening should be a big improvement already. I think the improved parsing in the client will be the next big fix.)
Anyway, I'm hesitating because I predict:
Breakage. I can totally see weird things happening as a result of this. XML sitemaps, search results, the interaction between the task database (ProcessingQueue objects) and the RECAPDocuments, HTML pages, PDF links, etc. Lots of stuff that interplays.
Users being confused. If you wind up on a docket, it has no metadata, and just a bunch of documents, you'll be pretty confused. Even if you load it in a tab, then switch to that tab later, it'd be very hard to even know how the tab was created or what case it represented. I think this is the best argument against having stubbed out docket pages for these documents.
Work. It's probably a good amount of work to do this and get it right.
I think if this is limited to only making the RECAP APIs functional and only making PDF URLs, we could probably mitigate most of my concerns. That'd mean no docket URLs, no search, no HTML document pages, and probably some other similar things.
One nice thing though: Getting this working would allow @bdheath's big law cases to work more easily.
- Users being confused. If you wind up on a docket, it has no metadata, and just a bunch of documents, you'll be pretty confused. Even if you load it in a tab, then switch to that tab later, it'd be very hard to even know how the tab was created or what case it represented. I think this is the best argument against having stubbed out docket pages for these documents.
I'm not seeing it. So such pages display a box at the top:
"This docket is empty because documents have been uploaded but a docket page has not, so RECAP doesn't know what case this is. To correct this problem, go and purchase the docket report for this case. Here's a friendly link: $link
"
and perhaps brand the page differently, like a different background color or something.
I tend to think none of that is actually necessary (people will figure it out!), but certainly a little exposition addresses it.
Some quick stats after having reprocessed all failed PDFs today:
So...the code that I landed yesterday means that another 369 items would have been incorporated into dockets that otherwise weren't, and we have something like 9,000 (7688 + 1240) PDFs that aren't available now, but that could be (if we can figure out how to do it right).
Also worth mentioning, somehow about 26% of uploads aren't working for folks due to this. That's fairly insane.
Digging some more. Of the 7688 that couldn't find dockets, 5202 are because they don't have a pacer_case_id
value. Instead they have either null
or undefined
. I'm not sure why those are happening, but they could certainly be from the CourtListener "Buy Now" buttons (https://github.com/freelawproject/courtlistener/issues/768). With those out of the way, it's only 2486 items that are orphaned, which is more like 8%. That's still high, but not so crazy.
Minor update. There are now about 200k PACER docs that we haven't been able to associate with a docket:
01:54:16::mlissner@new-courtlistener::/sata/recap_processing_queue
↪ find . -type f | wc -l
193197
That's lame.
(Though I hasten to add that this is a problem RECAP has had in one form or another since 2009.)
Would it be possible to make this list available somehow such that people with spare PACER credit could look at the documents and try to concoct a search which will find the appropriate case, to then load the docket? Germ of an idea from Twitter.
I think the direction for encouraging people spare PACER credentials is the Pay and Pray feature that is half built. https://github.com/freelawproject/courtlistener/issues/1346
Thanks for the pointer, that feature sounds great!
Still, am I right in supposing that the list of orphan documents has only grown since 2019? Without a way to target getting dockets for the relevant cases, I would imagine that 99% of this trove of already-purchased documents will remain unviewable effectively forever.
I think it'd fit in nicely to the Pay and Pray. You could have a list of documents that people have prayed for, and a prioritized list of recent orphans.
Issue by mlissner Wednesday Mar 25, 2015 at 18:22 GMT Originally opened as https://github.com/freelawproject/recap-server/issues/56
This ticket is to explore the idea of creating virtual dockets when somebody tries to upload a document before it has a docket in RECAP. Currently this is happening when people use Operation Asymptote, but I wager that it happens fairly often outside of this too. When this happens currently, the document the person is uploading gets resoundingly rejected according to the code here.
My hunch is that creating virtual dockets is hard but doable. I don't (yet) have enough of the codebase in my head to really know the issues around this, but I'm guessing @sjschultze or @harlanyu might have thoughts. Is this something we can consider doing? As a first step, we could start tracking the frequency of this to see how big the problem is.
cc: @plainsite, @audiodude