freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
531 stars 147 forks source link

`cl.recap.mergers.find_docket_object` sometimes matches dockets when it shouldn't #4256

Open grossir opened 1 month ago

grossir commented 1 month ago

An example:

Docket 68295573 already has a case_name Van Camp v. Van Camp, different than new value State v. Snyder

The docket in Courtlistener, that would have been overwritten has docket number "1 CA-CV 23-0297-FC", case name Van Camp v. Van Camp

The docket number for the Snyder case is "1 CA-CR 23-0297", which is a different case

So, the "docket_number_core" with value "230297" matches, but it shouldn't

This is a single example for Arizona, but on Sentry there are more records.

There is another example where the mismatch doesn't have a straightforward solution:

There are some cases when it is a correct match, but the case name or other data point is slightly different: ca3


The offending logic is in this function

https://github.com/freelawproject/courtlistener/blob/723b7ec84101b18fa2f0aa0dcb7ef7788dc74361/cl/recap/mergers.py#L84-L169


Sentry Issue: COURTLISTENER-7XG

Docket 68295573 already has a case_name Van Camp v. Van Camp, different than new value State v. Snyder
mlissner commented 1 month ago

So:

grossir commented 2 weeks ago

I downloaded the details for 1236 events / logs from Sentry, which map to 486 unique dockets. Then, I manually inspected each court group, and found that some are indeed mixing up dockets, and some others are matching the correct docket, but bring updated values which may not be better than the old values

Incorrect docket match

The fladistctapp and az errors are due to an error when getting the docket_number_core , which ignores the parts of the docket that signal differences between districts or process types

In the case of ohio, the docket number is exactly the same, the scraper should be fixed to return a more detailed docket number

Court domain Dockets with error logs Reason Example
1dca.flcourts.gov 67 Missmatch across districts '5D2023-0888' and '2D2023-0888'
4dca.flcourts.gov 44
5dca.flcourts.gov 37
2dca.flcourts.gov 36
6dca.flcourts.gov 25
3dca.flcourts.gov 24
www.supremecourt.ohio.gov 20 Missmatch across counties Docket number is the same '22CA15' Doc 1, Doc 2
www.azcourts.gov 11 Mixing up Criminal and Civil docket numbers '1 CA-CR 23-0297' and '1 CA-CV 23-0297-FC' are matched

There are also some one-off mix ups. This one is due to a merger with harvard and lawbox

https://www.courtlistener.com/api/rest/v3/dockets/1558364/
Original:  In Re Pauley
New     :  Travis Norwood v. Jonathan Frame, Superintendent, Mount Olive Correctional Complex and Jail

This one was caused by a typo on the scraped web page, where they put a 22 instead of a 20

https://www.courtlistener.com/api/rest/v3/dockets/67836231/
Original:  1417 Belmont Community Dev., LLC v. District of Columbia
New     :  Lynch v. Ghaida

After fixing docket matching, we should find a way to separate the clusters mixed by this error. Hopefully, it is limited to the courts on the above table

Correct match, updated information

Assuming the matching problem is solved, we could decide to update the case name based on the length of the names. Sometimes newer case names are shorter; sometimes longer; and I think longer case names have more information by having the fuller party names

Examples of updates where the names are worse

https://www.courtlistener.com/api/rest/v3/dockets/68730521/
Original:  Kevin Kulak v. Itshak On
New     :  Kulak v. Itshak On

=========================
https://www.courtlistener.com/api/rest/v3/dockets/2615014/
Original:  State of Delaware v. Hobbs.
New     :  State v. Amir Fatir f/k/a Sterling Hobbs

=========================
https://www.courtlistener.com/api/rest/v3/dockets/66774469/
Original:  Sunil M. Malkani v. Gemma Cunningham
New     :  Malkani v. Cunningham

Examples of updates where the names are better:

https://www.courtlistener.com/api/rest/v3/dockets/68437417/
Original:  Overwell Harvest, Limited v. Trading Technologies Internati
New     :  Overwell Harvest, Limited v. Trading Technologies International, Inc.
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68454533/
Original:  Kalispell v. Diablo Investments
New     :  City of Kalispell v. Diablo Investments

=========================
https://www.courtlistener.com/api/rest/v3/dockets/68941229/
Original:  Matter of M.N., YINC
New     :  Matter of M.N. and M.N., Youths in Need of Care.

Related, we could improve the case name parsing for these courts:

https://www.courtlistener.com/api/rest/v3/dockets/68561913/
Original:  Ex parte The Housing Authority of the City of Talladega. PETITION FOR WRIT OF CERTIORARI TO THE COURT OF CIVIL APPEALS (In re: Harold Wallace v. The Housing Authority of the City of Talladega) (Talladega Circuit Court: CV-18-900509 Civil Appeals: 2210486).
New     :  Ex parte Housing Authority of the City of Talladega. PETITION FOR WRIT OF CERTIORARI TO THE COURT OF CIVIL APPEALS (In re: Harold Wallace v. The Housing Authority of the City of Talladega) (Talladega Circuit Court: CV-18-900509 Court of Civil Appeals: 2210486).
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68538816/
Original:  Ex parte Morgan Stanford and Matthew Hogue. PETITION FOR WRIT OF MANDAMUS: CIVIL (In re: Morgan Stanford and Matthew Hogue v. HCP Properties, LLC)(Jefferson Circuit Court: 22-901106).
New     :  Ex parte Morgan Stanford and Matthew Hogue. PETITION FOR WRIT OF MANDAMUS (In re: Morgan Stanford and Matthew Hogue v. HCP Properties, LLC)(Jefferson Circuit Court: 22-901106).
=========================
https://www.courtlistener.com/api/rest/v3/dockets/68206850/
Original:  State v. Yuen
New     :  State v. Yuen. Dissenting Opinion by Recktenwald, C.J., in Which Ginoza, J., Joins. ICA Order of Correction, filed 09/26/2023 [ada]. ICA s.d.o., filed 09/22/2023 [ada]. Application for Writ of Certiorari, filed 12/18/2023. S.Ct. Order Accepting Application for Writ of Certiorari, filed 01/30/2024 [ada].

Case names end with a code

https://www.courtlistener.com/api/rest/v3/dockets/68979773/
Original:  Riversiders Against Increased Taxes v. City of Riverside CA4/2
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68975608/
Original:  Holguin Family Ventures v. County of Ventura CA2/6
mlissner commented 2 weeks ago

Super helpful analysis. I don't know the solution half as well as you do, but one thing I'll note is that the shorter case names tend to be the better ones, actually, but this is essentially the difference between case_name and case_name_full:

There's also case_name_short, of course, which is usually just the first party: Lissner.