freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
553 stars 151 forks source link

Documents belong to multiple cases; multiple cases belong to one docket (the doppelganger bug) #2185

Open johnhawkinson opened 7 years ago

johnhawkinson commented 7 years ago

Overview

This is a long-standing issue but lately it comes up more and more for me.

• In CMECF, there is a many-to-one mapping between docket numbers and documents. A single document can belong to multiple docket numbers, as when an order is filed in two related cases.

• In CMECF, there is a many-to-one mapping between docket numbers and internal caseids (de_caseid). This is extremely common in criminal cases, where the numbers are generally contiguous. This is so when there are multiple defendants who each get a sub-case, but it is also so when there is a single defendant: there is a main case and a single subcase.

This throws a wrench in RECAP because different people will get to the same docket number via different caseid paths. Depending on what one searches for in PACER's iquery.pl and whether you choose All Defendants or single defendant or a combination thereof, you may get different (or multiple) caseids.

For instance, take 1:14-cr-10363-RGS USA v. Cadden et al in ecf.mad:

screen shot 2017-11-16 at 12 40 19

Or in XML form, query https://ecf.mad.uscourts.gov/cgi-bin/possible_case_numbers.pl?1410363 (free) to get:

<request number="1410363">
  <case number="1:14-cr-10363" id="166116" title="1:14-cr-10363-RGS USA v. Cadden et al" defendant="0" sortable="1:2014-cr-10363-RGS"/>
  <case number="1:14-cr-10363-1" id="166117" title="1:14-cr-10363-RGS-1 Barry J. Cadden (closed 06/27/2017)" defendant="1" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-2" id="166118" title="1:14-cr-10363-RGS-2 Glenn A. Chin" defendant="2" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-3" id="166119" title="1:14-cr-10363-RGS-3 Gene Svirskiy" defendant="3" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-4" id="166120" title="1:14-cr-10363-RGS-4 Christopher M. Leary" defendant="4" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-5" id="166121" title="1:14-cr-10363-RGS-5 Joseph M. Evanosky" defendant="5" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-6" id="166122" title="1:14-cr-10363-RGS-6 Scott M. Connolly" defendant="6" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-7" id="166123" title="1:14-cr-10363-RGS-7 Sharon P. Carter" defendant="7" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-8" id="166124" title="1:14-cr-10363-RGS-8 Alla V. Stepanets" defendant="8" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-9" id="166125" title="1:14-cr-10363-RGS-9 Gregory A. Conigliaro" defendant="9" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-10" id="166126" title="1:14-cr-10363-RGS-10 Robert A. Ronzio" defendant="10" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-11" id="166127" title="1:14-cr-10363-RGS-11 Kathy S. Chin (closed 10/04/2016)" defendant="11" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-12" id="166128" title="1:14-cr-10363-RGS-12 Michelle L. Thomas (closed 10/04/2016)" defendant="12" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-13" id="166129" title="1:14-cr-10363-RGS-13 Carla R. Conigliaro (closed 11/10/2016)" defendant="13" sortable="1:2014-cr-10363"/>
  <case number="1:14-cr-10363-14" id="166130" title="1:14-cr-10363-RGS-14 Douglas A. Conigliaro (closed 11/10/2016)" defendant="14" sortable="1:2014-cr-10363"/>
  <case number="1:14-cv-10363" id="157735" title="1:14-cv-10363-DPW Spencer v. Fresenius Medical Care Holdings, Inc. et al" sortable="1:2014-cv-10363-DPW"/>
</request>

All caseids from 166116-166130 refer to the same docket number. Many (most?) documents in the case belong to multiple (all?) subcases.

But RECAP and CL treat them like differenet dockets with identical docket numbers, and don't show the subcase suffix number either.

For instance the main case is https://www.courtlistener.com/docket/4275782/united-states-v-cadden/ which has through docket entries through DE514 (Jan. 2016), and was last updated 2 months ago.

But the -1 case is https://www.courtlistener.com/docket/5135835/united-states-v-cadden/ which has through DE1260 (Oct. 24), 2017, and was last updated 12 days ago,.

But the -2 case is https://www.courtlistener.com/docket/6145187/united-states-v-cadden/ has through DE1281, but was also updated 12 days ago.

Although the -2 case is more recent, it doesn't actually have the PDF for DE1260.

So this is like a huge mess.

Single-defendant criminal cases, too

The problem even occurs for single defendant criminal cases, although the path to pain is less obvious. Let's take our friend George Papadopoulos, in ecf.dcd. He's the sole defendant and it looks like there's only one case:

https://ecf.dcd.uscourts.gov/cgi-bin/possible_case_numbers.pl?17182

<request number="17182">
  <case number="1:17-cr-182" id="189898" title="1:17-cr-00182-RDM USA v. PAPADOPOULOS" sortable="1:2017-cr-00182"/>
  <case number="1:17-cv-182" id="184128" title="1:17-cv-00182-APM MITCHELL v. YELLEN" sortable="1:2017-cv-00182-APM"/>
</request>

So it looks like it's just 189898. But, surprise:

https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?189897 1:17-cr-00182-RDM USA v. PAPADOPOULOS https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?189898 1:17-cr-00182-RDM-1 - PAPADOPOULOS, GEORGE

The '898 is easily found in the PACER UI, but unfortunately we can't ignore the '897, because it appears in other places. For instance, the email NEF sent to parties and "interested party" ECF users yesterday:

Notice of Electronic Filing 
The following transaction was entered  on 11/15/2017 2:42 PM EDT and filed
on 11/9/2017 

Case Name: USA v. PAPADOPOULOS                                                  

Case Number: 1:17-cr-00182-RDM
https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?189897

Of course, this problem is more likely to effect people who use NEFs, which is lawyers and journalists, and not too many members of the general public. But those are important RECAP constituencies.

Upshot

CL needs to track the docket number and caseid for each document independently, recognizing there can be more than one of each. For sanity's sake, CL docket pages should make the caseid visible somewhere (IA docket pages had it in the URL), even if it's small and at the bottom. Makes debugging your brain much simpler.

CL should acknowledge the concept of subdockets. I'm not sure all of what this entails. This is a nice-to-have, but not critical. If all the searches for 1:14-cr-10363 returned an amalgamation of the main docket and 14 subdockets, that would not be so bad

Maybe

Perhaps the RECAP extension should query docket number against possible_case_numbers.pl, and report to the server associated caseids. I think this is a bad idea, because it means the extension is no longer passive, it can be identified (and blocked) by the courts, and it is using a nonpublic API. Furthermore, it would not return the second caseid in the case of a single-defendant case.

Perhaps the RECAP extension should query adjacent caseids against DktRpt.pl until it runs into a different docket number on either side. Again, for the same reasons as above, I think that's bad. Also it could be many queries. I ran into a 60-defendant case last night.

Perhaps the CL server should do these queries, maybe on a one-time basis.

Mitigation

It should be straightforward to identify, in the CL database, where there are multiple caseids for a given docket number, and then take some action to combine them. This is separate but related to from what the server and extension should do about this going forward.

Discuss!

johnhawkinson commented 7 years ago

footnote: it's also possible to run "combined docket reports" for multiple cases. I have not tried, but I assume these do the wrong thing in RECAP (I cannot imagine them possibly doing the right thing, given the lack of subdoc support).

For instance, in Cadden, checking subcases -1 and -2 from the iquery.pl form leads you to a docket page like this: https://ecf.mad.uscourts.gov/cgi-bin/DktRpt.pl?166118;166117 which, if run, gives you either both docket reports consecutively, or if you check the "Combined docket report" checkbox, merges the two reports together. Sorted by the specified sort order.

Even more frightfully from the RECAP perspective, this combined report feature is not limited to subcases. You can enter arbitrary unrelated caseids seperated by semicolons in the URL parameter string for DktRpt.pl and other queries.

Although probably nobody does this, so it's not a big worry. But the data model should accommodate it. And I think it screws up receipts (that is, they are not reliable indicators of the case to which the document belongs, if indeed the document belongs to a single case in the combined report).

mlissner commented 7 years ago

Looks like at least three issues here. The first issue here is that documents can belong to multiple cases. I've split that off into its own ticket: freelawproject/courtlistener#765

johnhawkinson commented 7 years ago

This is a better description of the "doppelganger cases" issue described in freelawproject/recap#36 and freelawproject/recap#146 [Editor's note: both now closed as dups].

johnhawkinson commented 7 years ago

For the record, subcases need not have consecutive caseids. See US v. Murgio in SDNY:

<request number="15cr769">
  <case number="1:15-cr-769" id="449632" title="1:15-cr-00769-AJN USA v. Murgio et al" defendant="0" sortable="1:2015-cr-00769-AJN"/>
  <case number="1:15-cr-769-1" id="449633" title="1:15-cr-00769-AJN-1 Anthony R. Murgio (closed 10/25/2017)" defendant="1" sortable="1:2015-cr-00769"/>
  <case number="1:15-cr-769-2" id="450676" title="1:15-cr-00769-AJN-2 Yuri Lebedev (closed 11/01/2017)" defendant="2" sortable="1:2015-cr-00769"/>
  <case number="1:15-cr-769-3" id="454366" title="1:15-cr-00769-AJN-3 Trevon Gross (closed 11/16/2017)" defendant="3" sortable="1:2015-cr-00769"/>
  <case number="1:15-cr-769-4" id="456495" title="1:15-cr-00769-AJN-4 Michael J. Murgio (closed 01/30/2017)" defendant="4" sortable="1:2015-cr-00769"/>
  <case number="1:15-cr-769-5" id="464041" title="1:15-cr-00769-AJN-5 Jose M Freundt" defendant="5" sortable="1:2015-cr-00769"/>
  <case number="1:15-cr-769-6" id="467688" title="1:15-cr-00769-AJN-6 Ricardo Hill" defendant="6" sortable="1:2015-cr-00769"/>
</request>
mlissner commented 6 years ago

Grr, automated commit message thing. This is not fixed.

mlissner commented 5 years ago

I confess I'm still not sure how to proceed here. When we have multiple pacer case IDs, are those IDs just a different view into the same docket or are they actually different dockets altogether? In some form, we need to link all these dockets together under one umbrella, like PACER does, but I don't understand what PACER is accomplishing with these well enough to understand how to do it in our UI.

danieldjewell commented 5 years ago

@mlissner I see the dilemma and yes, I think this is what I was running into with freelawproject/recap#267 ...

I've been thinking about it and, ultimately, I think @johnhawkinson hit on probably the best solution -- CL needs an additional layer that knows about the sub-dockets as that seems to be at the core. As noted, it seems that documents can belong to multiple sub-dockets simultaneously but those sub-dockets might also have their own unique items.

Right now, as described, CL treats these individual sub-dockets as separate cases in the database (e.g. they get their own CL docket ID number because the pacer docket ID is different). This results in the behavior we're seeing here. Instead, if there's a "related dockets" table, this data could be broken out and then separated.

Consider 2 situations:

Simple Docket

Use existing systems, no sub/related cases. Things function as normal.

Complex Docket

Sub-Dockets or related cases - Need to maintain information that allows for correlation of related dockets.

High Level Overview

New DB table that indicates the relationships between the dockets and, if needed, storage of any "master" information. Extend existing docket table to include a column indicating the "master" docket entry. (If null or 0, no master docket - or something similar). Keep maintaining separate CL entries for each sub-docket (because that matches how PACER works) but if there are multiple related entries, display a "master" page indicating all related dockets.

Correlating the cases could be accomplished in multiple ways - some more accurate/reliable than others:

UI Ideas

I'm not 100% sure how best to implement searching -- but for related dockets, display 1 entry in the search results leading to the master docket page that then shows the sub-related dockets. (Similar to how the PACER query looks that @johnhawkinson posted). On each sub-docket page, at a minimum, display a link back to the parent master docket - preferably a sidebar listing of related dockets.

Closing Thoughts / Caveats

I have not spent nearly enough time to fully understand the existing database structure and the inner workings (gotta pay the bills, we all know how that one goes) - there might be something I'm missing here but it seems that the only way to fix this is to give CL/RECAP the ability to know that dockets could be related to each other. I think that probably adding an additional DB layer to store that relationship information would enable this to be resolved once and for all.

Consider this code:

https://github.com/freelawproject/courtlistener/blob/19b215cfc62f91044e02bfd0a827f56cfa23ecb0/cl/recap/tasks.py#L1355-L1360

If I'm reading this correctly, )a) the current method for dealing with duplicate dockets is to update the oldest and (b) the system differentiates between CL dockets on the basis of the PACER case ID/docket number... which can be different for related cases.

johnhawkinson commented 5 years ago

I think @johnhawkinson hit on probably the best solution

I thought that was undisputed :).

Correlating the cases could be accomplished in multiple ways - some more accurate/reliable than others:

Well, normally speaking all the case numbers in this situation are consecutive, so that's a huge win.

Offline/Back-end correlation between documents (e.g. if a document appears in one docket and the exact same document appears in another... SHA1 hashes of the documents are already available in the DB, could query on this)

Well. Please don't use the term "related" for multiple subdockets of the same master criminal docket. We use the term "related" to refer to a different kind of relationship between cases, like where I file a civil action a year after you did while yours is still pending and they address common issues of law but joinder may not be appropriate, so I mark my case as related to yours and they are typically assigned to the same judge for reasons of judicial economy (varies district-to-district). Or similarly in an MDL context. This usage is important because:

Court staff have the ability to file a CMECF document in multiple cases, and those cases will all refer to the same docket number. The cases need not have a subcase relationship. It is typically the case that this happens in related cases, though, using the "related" meaning that I have explained above.

Parsing the case number more ? (also, perhaps, tracking changes to the case number? this one is complicated)

I'm not entirely sure what you mean by this.

The RECAP extension demonstrates an interesting method -- when viewing a docket report on PACER, the extension queries the RECAP API for document availability (so that it can display that nice "R" icon next to the document links to indicate that the document exists in RECAP). From my research, this query utilizes only 2 parameters: the court abbreviation and the PACER document ID. So, in that sense, that should make a unique key: if a (court,pacer_doc_id) pair is found in multiple docket reports, it seems (??) that you could make the inference/conclusion that in whichever docket reports that is found, the cases are related?

See above. They are likely to be related (but possibly not; say a judge gets sick and the chief judge dockets a stay/postponement order in all of his active cases with calendar dates in the next week), but not necessarily with a subdocket relationship.

bishwashere commented 3 years ago

The data model at https://www.courtlistener.com/api/rest-info/ can have this change to begin with:

RECAPDocument table cannot have "docket_entry". More than one case (and therefore dockets) can refer the same document. This is not only common in criminal cases, but in any case. Therefore, DocketEntry table must keep the reference of the document instead.

danieldjewell commented 3 years ago

I'm not sure if there's a separate issue on this but: The problem of (what appears to be) a single docket in PACER turning into multiple dockets/cases on RECAP is still a major issue. See: https://www.courtlistener.com/?type=r&q=&type=r&order_by=score%20desc&docket_number=2%3A18-cr-00422&court=azd

I need to do more digging but it appears that all 8 of these RECAP dockets will lead to the same PACER docket report (when using the "View on PACER" blue header button).

More interestingly/concerningly, documents are being uploaded and associated, but not always with the same RECAP docket. Further, the RECAP extension appears to be able to find the document availability in RECAP without an issue... (when viewing the "do you want to buy this document" page in PACER)

I remember there being a discussion about how a (supposedly single) PACER docket could somehow turn into multiple RECAP dockets. Regardless, this is becoming a bigger and bigger issue.

I need to look a bit more at the 8 different RECAP dockets in the search link above but it does appear that there are documents that are associated with only one of the RECAP dockets. (In other words, there are unique documents in each RECAP docket.)

From a data accuracy/integrity standpoint, this is kinda messy. Perhaps solving the creation of multiple dockets in RECAP is unnecessary - perhaps the solution is to make the links work in every RECAP docket? (assuming there's something in the database that would associate the multiple RECAP dockets)

johnhawkinson commented 3 years ago

I believe this is the proper issue, @danieldjewell. The case you cite, USA v. Lacey, is expected to have 8 RECAP dockets, since there are 7 criminal subcases plus the master case:

2:18-cr-00422-SMB USA v. Lacey et al -
2:18-cr-00422-SMB-1 Michael Lacey
2:18-cr-00422-SMB-2 James Larkin
2:18-cr-00422-SMB-3 Scott Spear
2:18-cr-00422-SMB-4 John Brunst
2:18-cr-00422-SMB-5 Dan Hyer
2:18-cr-00422-SMB-6 Andrew Padilla
2:18-cr-00422-SMB-7 Joye Vaught

I do think it's a correct observation that the CourtListener docket report should stop searching by case number and document ID and merely search by document ID, and that would remove some of the pain, at least where the docket report had been run.

But this problem calls out for more serious attention than it has gotten, since basically "RECAP is unusable for criminal cases" is where it shakes out, and that just sucks.

hughbe commented 2 years ago

I raised the issue #2181, cited above. I’m wondering about how this problem can be fixed. Would it be possible to merge identical dockets? Or for example to make a request to do so?

GammaGames commented 1 year ago

FYI this issue was referenced on the Law SE site: Why are there two case numbers for United States v. Trump?

vwkd commented 5 months ago

+1 For United States v. Assange there also seem to be two similar entries.

https://www.courtlistener.com/docket/14488925/united-states-v-assange/ https://www.courtlistener.com/docket/14488287/united-states-v-assange/

https://www.courtlistener.com/docket/68881226/united-states-v-assange/ https://www.courtlistener.com/docket/68881225/united-states-v-assange/

flooie commented 4 months ago

@mlissner

I'm not going to breaking any new ground here @mlissner with this comment but criminal is a mess and I think may have a lot to do with why criminal is not as hot a topic as civil.

Because we do not properly link and process criminal cases we are creating numerous problems for our users and not doubt confusion. Pacer creates a parent docket and child dockets for each criminal case. Every child docket goes onto the parent docket.

We need to update our model to mimic this pattern and point our users to the parent docket, while also allowing someone to find the child docket if they so choose.

To do this we need to

We may have some difficulty always identifying the parent pacer case id in single defendant cases - because the numbers are not sequential (but normally are). And it's going to be nerve racking decoupling docket entries.

But right now our users could subscribe to a criminal case - that case may plead out and miss the remaining 15 years of updates. The could also buy a document and not realize it was added to a child docket or the parent docket. They could also buy the same docket multiple times across dockets.

Yikes.

Once we do this though we will be able to look at one docket - reduce our search queries in criminal cases a lot and reduce the number of documents store because of duplicates.

mlissner commented 4 months ago

Some investigation about how to fix this...

  1. If you get the party information for a sub-docket, it shows only the defendant for that docket, including their number:

    image

  2. You can get the defendant name and number from docket page:

    image

    ...or the iquery page (free):

    image

    ...or the hidden API (duh, free):

    image

    ...or the docket query page (free):

    image

  3. if the docket doesn't have the defendant number, that means it's the "All defendants" main docket.

  4. Every criminal case seems to have a main docket and subdockets. For example, here are two dockets for U.S. v. Boeing:

    This is pretty lame. I'm not sure the point of this and I wonder if we should hide the child dockets for such cases entirely.


So I think there are a few big pieces of this:

  1. Search: How should we show this in search so people understand what's going on? I'm not totally opposed to how it is now. Maybe just add the defendant number to the docket number, and that's enough?

    image

    And:

    image

  2. Capturing defendant numbers. I think we can upgrade the RECAP parsers to start including this in their output. That'd be step one. I think from there, we can add a defendant_number field to the docket table. We could start populating it with RECAP uploads pretty easily, and we could think about ways of backfilling it.

    It can be pulled from the massive iquery crawl we're about to do (we'll save the HTML), from existing HTML we've got, from the hidden API or from the IDB. Of these, the IDB looks the most promising. From it's data dictionary handbook:

    ![image](https://github.com/user-attachments/assets/60df6109-a880-4e51-838f-a034d9591abd)

    If we use the IDB, we get all the defendant numbers in one swoop. It says it has the defendant names, but I'm doubtful that the field is populated.

  3. That'd get us all the historical defendant numbers, which is pretty great, and it'd leave a few problems:

    1. The dockets aren't linked. A few ways to fix this:

      • Via the DB: A new table called, say, "related_dockets" with two columns, parent_docket and main_docket.
      • Via the DB: A new self-join field on the docket table called "parent_docket", which we could populate for sub-dockets.
      • Via Elastic: We just search on docket number and court, and that's that. This is bad for people relying on the DB, and is probably not the way to do it.

      I think I prefer the second approach.

      Once this is fixed in the data model, we can add it to the interface. Probably the easiest way is to add a new tab to the docket page where you can find the related cases. Ideally it wouldn't show up for single-defendant cases.

    2. The documents and docket entries can't be associated with multiple dockets. This is a much harder problem to solve because it means converting our FK from the docket_entries table to a m2m table. That's doable technically, but I don't love it because it makes an already complex data model even worse, it will come with performance pain, and our replication customers, APIs, etc will be hard to upgrade.

      A part of me is tempted to just ignore this failing in our data model and copy data around instead. Storage is cheap. Maybe that's an easy way to fix this, but it's not ideal.

  4. Finally, I don't think adding a new defendant_party_name field or something like that to the docket is a good idea. We've already got it via the parties table.


So, summarizing, what I think we need to do is:

  1. Finish researching the IDB as a source.
    1. If IDB won't work, upgrade the iquery ingestor to save the HTML, and run it.
  2. Upgrade juriscraper parsers to include the defendant_number.
  3. Add defendant_number and parent_docket fields to the search_docket table.
  4. Upgrade our ingestion pipeline to add the defendant_number to the docket when saving new HTML.
  5. Go through all the iquery HTML or IDB data and grab the defendant number and name, if we can.
  6. Upgrade our ingestors to populate the parent_docket field.
  7. Run a script to populate past parent_docket fields (use the defendant_number fields.

That's a solid start and leaves the documents part. I'm not sure what to do there yet.

flooie commented 4 months ago
  1. We cant use IDB to populate the defendant number because the name field isnt populated.
  2. We wouldnt be able to tell if a case was a parent case or a sub docket in those cases without checking the pacer case id against the docket query page.
  3. I disagree that we should keep the search as it is. It makes more sense to give the main docket for a result. Otherwise we just confuse a user with the same result 15 times.
mlissner commented 4 months ago

Thanks. Good point about IDB. Bummer about that.

Memorializing the conversation from our 1-on-1:

That should get us pretty far towards a solution. Other goals:

I think that'll get us close.