Circuit dockets have the docket number (xx-xxxx) in the pacer_case_id field

freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.

https://www.courtlistener.com

Other

544 stars 150 forks source link

Circuit dockets have the docket number (xx-xxxx) in the pacer_case_id field #2506

Closed mlissner closed 1 year ago

mlissner commented 1 year ago

I noticed today that the "View on PACER" link on a bunch of circuit court dockets doesn't work. @ERosendo's explanation is that it's because the docket number (xx-xxxxx) is in the pacer_case_id field instead of the pacer_case_id for the docket.

I'm not sure how that's happening, but I sort of think it was (or is?) on purpose. Now that we're deeper into the circuit court RECAP world, we should take another look at this and see how it is happening, if it still is, and whether can or should fix it.

Another problem this causes is that when you do upload that case again later, it mis-matches in the find_docket function, and you wind up creating a second one instead of merging with the original. Not great.

albertisfu commented 1 year ago

I've checked this, seems that in fact, the use of the docket_number instead of the pacer_case_id was on purpose in the past, e.g:

You can see this one was uploaded by RECAP and the pacer_case_id in the ProcessingQueue is the docket_number https://www.courtlistener.com/docket/59241642/gun-owners-of-america-inc-v-doj/ https://www.courtlistener.com/admin/recap/processingqueue/4534786/change/

Another example: https://www.courtlistener.com/docket/6150738/kiobel-v-cravath-swaine-moore-llp/ https://www.courtlistener.com/admin/recap/processingqueue/755405/change/

There are others where I couldn't find a related PQ but the docket has an XML stored containing the docket data and you can see the XML doesn't have a pacer_case_id field but pacer_case_num and is the value of docket_num: https://www.courtlistener.com/docket/6143458/curt-gilgenbach-v-state-of-illinois/ https://ia800808.us.archive.org/32/items/gov.uscourts.ca7.17-2011/gov.uscourts.ca7.17-2011.docket.xml

So, maybe in the past appellate dockets didn't have a pacer_case_id so the pacer_case_num was assigned. Now, since appellate dockets have a pacer_case_id and the RECAP extension uses it, this seems to not continue happening.

mlissner commented 1 year ago

It looks like we have a lot of appellate numbers in the pacer_case_id field. We're going to remove these so the field is clean.

When we do, the pacer_docket_url property of the Docket object will no longer work, since it relies on the pacer_case_id. I think it's possible to make these URLs using the docket number instead. @ERosendo, can you please check if that's possible, and do a small PR to update the property so it works even when we only have the docket_number?

ERosendo commented 1 year ago

@mlissner It's possible to create the URL using the docket number. In fact, Appellate pages use that value instead of the case id to create the link to the General Docket and CL was creating links that use the caseNum until We fixed this issue https://github.com/freelawproject/courtlistener/issues/1865.

mlissner commented 1 year ago

Funny how quickly I forget things, @ERosendo! Never forget how bad my memory is. :)

So, it sounds like we can make it work using either the pacer_case_id (preferred) or the docket_number, so I'd say let's make it use both. If the pacer_case_id is there, we use it. If not, we fall back to the docket number?

mlissner commented 1 year ago

@albertisfu we talked about a lot of things today, but remind me the plan we had here. Is this right:

Any appellate case that has a dash in the pacer_case_id field gets it removed, assuming it's not merged with opinion clusters.
You're going to look for duplicates that we have in our system?

Was there more to this aside from Eduardo fixing the URL?

albertisfu commented 1 year ago

Any appellate case that has a dash in the pacer_case_id field gets it removed, assuming it's not merged with opinion clusters.

Yeah, that's correct, additionally, I'll add a validation to prevent it from storing a docket_number on the pacer_case_id field, so strings containing a - are not allowed, I think this validation should be at the RECAP Upload and RECAP Fetch serializers, right?

You're going to look for duplicates that we have in our system?

Yeah, I'll do a query that helps us to find duplicates for this problem so we can remove them before cleaning the pacer_case_id field.

Was there more to this aside from Eduardo fixing the URL?

Just the things above, I'll be back here once they're done, so we can find the duplicates, remove them and clean the field.

mlissner commented 1 year ago

I think this validation should be at the RECAP Upload and RECAP Fetch serializers, right

Yeah, we could put it at a lower level (the save or something?), but I don't think it's needed. I doubt we'd make the mistake again, so if we just prevent old RECAP API clients from making the mistake going forward, we should be good.

albertisfu commented 1 year ago

I'll still post the query here to look for duplicates so we can remove them before cleaning the pacer_case_id

albertisfu commented 1 year ago

Ok, discussing about duplicated dockets due to this issue with @mlissner we agree that is a bit risky to just remove the duplicated dockets due to we can affect content that is unique on some of them, for example, there are cases where the older docket doesn't have docket entries and the newer one does (or the other way around). Or some cases where two of them have the same docket entries but only one of them has Parties and Attorneys, and a third one is blank e.g:

https://www.courtlistener.com/docket/19085/leilonni-davis-v-tmc-restaurant-of-charlotte/ https://www.courtlistener.com/docket/23348/parties/leilonni-davis-v-tmc-restaurant-of-charlotte/ https://www.courtlistener.com/docket/59890561/leilonni-davis-v-tmc-restaurant-of-charlotte/

So, we considered this could be fixed but it'll require much more work to analyze and figure out the right way to merge dockets content before removing duplicates.

So for now we can just clean the pacer_case_id on dockets that contain a docket_number in this field.

from cl.search.models import Docket, Court

dockets = Docket.objects.filter(court__jurisdiction=Court.FEDERAL_APPELLATE, pacer_case_id__contains="-").update(pacer_case_id="")
print("Updated dockets: ", dockets)

So that once the pacer_case_id is cleaned these dockets can get a good PACER URL based on its docket_number and they could be updated by RECAP via its docket_number_core and get the right pacer_case_id (if they don't have duplicates).

Let me know if we need to check something more here.

albertisfu commented 1 year ago

@mlissner before cleaning the pacer_case_id I'll submit a new PR that modifies a validation in the docket saved method where the pacer_case_id shouldn't be null when the source is RECAP, so that an error is not raised when increasing the docket view counter.

I'll let you know when is ready, so it can be merged before cleaning the pacer_case_id.

Thanks, @ERosendo for the hint about this possible issue.

albertisfu commented 1 year ago

ok, the PR that solves the issue described in the comment above is ready: #2516 So that once it's merged, the pacer_case_id field can be cleaned on appellate dockets that contain a docket_number in the pacer_case_id

mlissner commented 1 year ago

Done. I ran the code above. Things are getting cleaner around here! It updated 25,266 items.