freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
545 stars 151 forks source link

Date Discrepancy in an Opinion Cluster Public Resource Opinion #4645

Open flooie opened 15 hours ago

flooie commented 15 hours ago

Date Discrepancy in an Opinion Cluster (maybe more)

Description:

A user reported that certain dates in our opinion clusters appear to be off by a day. Upon investigation, this discrepancy is confirmed. All three original data sources appear to have the correct date - so this is a curious error.

Investigation Notes:

  1. Reviewing the pghistory snapshots for the opinion cluster ID 197095 shows a history of changes, but none that clearly indicate a date modification affecting the filing date.
  2. Data Source Consistency: The discrepancy predates our pghistory tracking and the import of data from the Harvard dataset, suggesting the error may originate in the initial data source.

The fact that it is one day off suggests to me an error in some import or merge - likely around timezones although im not sure how that would actually happen.

Actual Changes found:

The following changes were recorded in pghistory, though they do not align directly with the date discrepancy:

•   Judges: Changes in judge order.
•   Attorneys: Updates in attorney information.
•   Citation Count: Increment in citation count.
•   Date Modified: Updates in modification timestamps.

No specific alterations to filing dates appear in the tracked snapshots.

Next Steps:

  1. Fix this case - and do an Audit using the original source materials. I think this started as a Public Resource Cluster - so Im going to start by reviewing all of the public resource opinions.
flooie commented 15 hours ago

@grossir @quevon24

If either of you has suggestions on this I would be happy to hear them.

flooie commented 15 hours ago

So far - after a cursory look at a couple hundred similar opinions from Public Resource I am not seeing other opinions like this one.

I think we need more examples to better understand/unravel what happened here.

mlissner commented 14 hours ago

Importing the PRO content was difficult and I had some tricky solutions:

  1. Some didn't have dates and only would say something like "Spring 1854," so I had to just do an estimate for these (we have a field in the DB that says something like date_is_estimated or something).

    I don't think this factored into the case we're looking at.

  2. I used date parser to pull dates out of cases. It works quite well, but can sometimes find something that looks like a date, like a docket number, say, and interpret that as a date.

    I checked the case we're looking at and the date it has doesn't look like it could have come from a bad parse, so this theory doesn't make sense either.

  3. Some cases couldn't be parsed for dates with the skills I had at the time, so I had a script I ran for months in my spare time. It would pop up a case in my browser and would allow me to input the date (or choose from several it found). I did this for about 100k cases, I think. It took months, but, well, it got the job done?

    This could have been what happened here, but I'm doubtful of this too because the date looks pretty easy to parse from the text.

Is any of this helpful? Probably not, but it's history worth sharing, I think.

What to do now? An audit makes sense to me. We're skilled enough to make a very simple parser of the first 500 characters of the HTML, of example, and see if the date found in it lines up — or there are probably another dozen ways to check this, so I'll duck out. But, yeah, let's get on this.

mlissner commented 14 hours ago

One other note: This case was imported in 2011, so it's one of the oldest we have, a fact you can see from its ID being lower than 200k.

flooie commented 13 hours ago

@mlissner - I dont mean to impugn your import it is a difficult data set and could be something else - but that Is my best guess

Also Oct. 30, 2015, 2:57 p.m. - the date created is 2015... any idea why ... was there a new import some big database switch?

mlissner commented 13 hours ago

The opinion was 2011, the cluster was 2014, and the docket was 2015. Guessing these correspond to the creation of those objects, but I honestly don't recall.