freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
546 stars 151 forks source link

Docket/OpinionCluster Source Mixup #2717

Open flooie opened 1 year ago

flooie commented 1 year ago

In preparing a model change in a small PR - I went to update only the required sources in the Opinion Cluster Source Options.

To do that I pulled all the distinct sources for to be merged opinions. I ran the following code to extract out the files I needed to updated and then queried the distinct values.

from cl.search.models import Docket, Opinion, OpinionCluster

In [2]: sources_without_harvard = [
   ...:             source[0]
   ...:             for source in Docket.SOURCE_CHOICES
   ...:             if "Harvard" not in source[1]
   ...:         ]

In [3]: cluster_ids = OpinionCluster.objects.filter(
   ...:             docket__source__in=sources_without_harvard,
   ...:             filepath_json_harvard__isnull=False,
   ...:         ).values_list("source", flat=True).distinct("source")

In [5]: print(cluster_ids)
<ClusterCitationQuerySet ['64', 'C', 'CR', 'D', 'L', 'LC', 'LCR', 'LR', 'M', 'R', 'Z']>

They all kind of make sense - except ... 64 of which there are 23,165. The first was imported in 2014 and the source on the admin page says merged from resource.org. Not sure what is going on here.

mlissner commented 1 year ago

I'm sorry, I really don't understand what you're trying to do, or what you did, nor if there's a problem. What's 64? What's 23165? Are you doing an update of items from one value to another?

flooie commented 1 year ago

I don't know what 64 is - It looks like its from an old import. Where anything went?

mlissner commented 1 year ago

Sorry, I was struggling to make sense of this yesterday, but it seems obvious to me now. My bad. So there are 23k items with the source of 64. Can you share a few?

flooie commented 1 year ago
4808157: O'Shea v. Commissioner
4808163: Leger v. Commissioner

Looks obviously now as using the docket sources in a cluster source field.

mlissner commented 1 year ago

I think so, yep.

flooie commented 1 year ago

I assume we shouldn't use Q for the Anonymous dataset

ANON_2020 = "I"
(ANON_2020, "2020 anonymous database"),
mlissner commented 1 year ago

No, probably not. What source do the dockets on these show?

flooie commented 1 year ago
ANON_2020 = 64
(ANON_2020, "2020 anonymous database"),
flooie commented 1 year ago

All the letters that would make sense are taken.

mlissner commented 1 year ago

What source do the dockets on these show?

flooie commented 1 year ago

... ANON_2020 = 64 (ANON_2020, "2020 anonymous database"),

I was proposing the new opinion cluster source - name and letter above.

mlissner commented 1 year ago

Oh, got it, the dockets have the same thing as the clusters, 64. Yeah, Q seems like the right letter for it. Funny and memorable. Why not.

flooie commented 1 year ago

@mlissner - well... I think it's the right letter too, but it was also a joke because Q Anon is a conspiracy theory.

mlissner commented 1 year ago

That wasn't lost on me. :D

flooie commented 1 year ago

Just wanted to make sure -

flooie commented 1 year ago

This is fixed.

ERosendo commented 4 months ago

@mlissner @flooie while I was importing some caselaw records, I found that there are still 23,161 records with source 64

Fortunately, I found a PR https://github.com/freelawproject/courtlistener/pull/2727 that addresses this issue by adding a new source. To clean these records, we can simply run the following code:

from cl.search.models import OpinionCluster, SOURCES
OpinionCluster.objects.filter(source='64').update(source=SOURCES.ANON_2020)
ERosendo commented 4 months ago

we can simply run the following code:

I take this back.

I noticed during my review of PR #2727 that there are two sources two sources for anon import

ANON_2020 = "Q"
ANON_2020_M_HARVARD = "QU"

Since ANON_2020 appears more generic, @flooie would it be safe to assume it's the appropriate replacement for the current value "64"?

mlissner commented 4 months ago

I'll answer for Bill. Yes. If the value is 64, we can replace it with Q.

mlissner commented 4 months ago

Just ran the code:

In [15]: from cl.search.models import OpinionCluster, SOURCES

In [16]: OpinionCluster.objects.filter(source='64').update(source=SOURCES.ANON_2020)
Out[16]: 23161

In [17]: OpinionCluster.objects.filter(source='64').count()
Out[17]: 0

@flooie or @quevon24, would this thing lingering around for the past year impact any of our importers?

mlissner commented 1 month ago

@flooie, can you provide an update here, please?

mlissner commented 6 days ago

@flooie can you provide an update here, please?