Open grossir opened 5 months ago
Thanks Gianfranco. One thing to note is that citations are not unique. Because they refer to the page that something is published on, it's entirely possible for multiple decisions to be published on the same page. That said, more than 10-20 on a single page makes no sense.
@flooie can you make a plan for digging into these and seeing what we can learn and how to prioritize this against our other projects?
@flooie Could you share more about the underlying reasons for a single (volume, reporter, page) combination having multiple cluster id's?
So far I understand that it may be the case when:
(2022 WY 137)
1 and 2, seem equivalent.Additionally, is there any guidance you can give on how to determine which cluster_id
to pick for cases (2.) and (3.)?
I have identified 2 types of dirty citation data:
Duplicated citations that match opinion duplications
For example, the neutral citation
2013 IL 110810
maps to 2 opinion clusters 1, 2. The opinion text seems to be mostly the same, but they have been obtained at different times. One looks to come from the official reporter, the other was probably scraped at (pre)publication timeI would expect these "dirty" citations to match 2 or 3 opinion clusters, at most
Citations that match hundreds of opinions
I downloaded the latest citation file from the bulk data directory
citations-2024-03-11.csv.bz2
, which has these columns"id", "volume", "reporter", "page", "type", "cluster_id"
, matching the DB model.If we had no duplicated citations, a DISTINCT over
"volume", "reporter", "page", "type"
would return the same number of rows as the whole table has. This query returns7 524 430
rows, against the9 987 094
rows on the whole datasetI ran a GROUP BY, COUNT over those fields, and got
231 665
citations that match more than 3 opinion clusters. Some match hundredsTop ten looks like this
Looking at the second one on Courtlistener shows that all the results have the same date... And most I have looked have the
vlex
bannerLooking at the first one on Courtlistener
About USLW, 51 of the top 100 by count citations are from that reporter AND from volume 82. An example. Also seeing a lot of
vlex
banners for this one. Maybe a data ingestion / merging issue?