freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
538 stars 148 forks source link

Dirty `search_citation` data #3979

Open grossir opened 5 months ago

grossir commented 5 months ago

I have identified 2 types of dirty citation data:

  1. duplicated citations that match duplicated opinions
  2. corrupt citations: the same citations for hundreds of different opinions

Duplicated citations that match opinion duplications

For example, the neutral citation 2013 IL 110810 maps to 2 opinion clusters 1, 2. The opinion text seems to be mostly the same, but they have been obtained at different times. One looks to come from the official reporter, the other was probably scraped at (pre)publication time

I would expect these "dirty" citations to match 2 or 3 opinion clusters, at most

Citations that match hundreds of opinions

I downloaded the latest citation file from the bulk data directory citations-2024-03-11.csv.bz2, which has these columns "id", "volume", "reporter", "page", "type", "cluster_id", matching the DB model.

If we had no duplicated citations, a DISTINCT over "volume", "reporter", "page", "type" would return the same number of rows as the whole table has. This query returns 7 524 430 rows, against the 9 987 094 rows on the whole dataset

I ran a GROUP BY, COUNT over those fields, and got 231 665 citations that match more than 3 opinion clusters. Some match hundreds

Top ten looks like this

reporter volume page type row count
Ill. Dec. 307 312 2 400
Ohio 2018 365 2 193
Ohio 2018 1600 2 187
U.S.L.W. 82 3182 4 168
U.S.L.W. 82 3183 4 168
U.S.L.W. 82 3186 4 168
U.S.L.W. 82 3187 4 168
U.S.L.W. 82 3188 4 168
U.S.L.W. 82 3329 4 168
U.S.L.W. 82 3406 4 168

Looking at the second one on Courtlistener shows that all the results have the same date... And most I have looked have the vlex banner

Looking at the first one on Courtlistener

About USLW, 51 of the top 100 by count citations are from that reporter AND from volume 82. An example. Also seeing a lot of vlex banners for this one. Maybe a data ingestion / merging issue?

mlissner commented 5 months ago

Thanks Gianfranco. One thing to note is that citations are not unique. Because they refer to the page that something is published on, it's entirely possible for multiple decisions to be published on the same page. That said, more than 10-20 on a single page makes no sense.

@flooie can you make a plan for digging into these and seeing what we can learn and how to prioritize this against our other projects?

L4rryFisherman commented 6 days ago

@flooie Could you share more about the underlying reasons for a single (volume, reporter, page) combination having multiple cluster id's?

So far I understand that it may be the case when:

  1. Multiple opinions are published on the same page. E.g. An opinion of a few sentences is preceded by another opinion on the same page.
  2. The reporter makes an n-th publication to the same volume, page. E.g. Errata are published to the same page as the opinion.
  3. Unintentional duplicates. E.g. records for (2022 WY 137) 1 and 2, seem equivalent.

Additionally, is there any guidance you can give on how to determine which cluster_id to pick for cases (2.) and (3.)?