freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 150 forks source link

Merge Harvard Opinions #1080

Closed flooie closed 1 year ago

flooie commented 4 years ago

Need code to merge Harvard Case law data into Courtlistener.

Including changes to db model

flooie commented 1 year ago

As good a time as any to make this a larger opinion.

We need to merged a bunch of data from harvard into the already stellar data of Courtlistener.
This means we need to identify the issues we are going to face.

After reviewing a number of issues im going to spotlight the major issues (quantity not difficulty).
Most of these are going to be easy to overcome.

quevon24 commented 1 year ago

I have some comments regarding the docket numbers.

Currently the docket number is stored exactly as it is obtained from the json(https://github.com/freelawproject/courtlistener/blob/main/cl/corpus_importer/management/commands/harvard_opinions.py#L463) No pre-processing is done.

Some docket numbers were omitted, for example: we have Slip Op. 18-165; Court No. 17-00031 as docket numbers, but only 17-00031 is stored in courtlistener, the other is missing.

Some docket numbers contains citations?, like this: 2 Div. 736, 3 Div. 474, 8 Div. 308 You can see this here https://api.case.law/v1/cases/3590929/ or here https://api.case.law/v1/cases/3589844/

There are many incorrect docker numbers, for example in courtlistener we have a case with this docket number: 16SC916, Thompson but the correct docket no. is: 16SC916 (https://api.case.law/v1/cases/12571899/) or we have a case with docket number: 18-1573P but the correct docket no. is: 18-1573 (https://api.case.law/v1/cases/12519449/)

Some cases have multiple docket numbers, like Nos. 2017-1715; 2017-1716 or Nos. 13AP-133; 13AP-134 and these were normalized in courlistener as 13AP-133 and 13AP-134 and 2017-1715 and 2017-1716 but others have this format: SJC 12440 & 12563 using an & join the docket numbers, both ways are ok? or should we use only one symbol as a separator? Note: I couldn't find how did you do this because docker number is stored exactly as it is obtained in harvard importer.

I already identified many docket number variations:

Already having identified the variations, we could use a simple regex to replace the variations to only keep the docket number, and in case of having multiple docket numbers separated by semicolons, we could split them, replace the variations and join the docket numbers again.

flooie commented 1 year ago

We wrote a script to compare data from CL to Harvard, the bankruptcy court gave us the following gems.

The CL data is the first tuple item and the harvard data is the right. The script compared the data and only showed examples where data was present in both data sources but different.

/storage/harvard_corpus/law.free.cap.br.454/133.4181942.json
----
Cluster: 2196558 Harvard_id: 4181942
judges           ('Robert J. Faris', 'Faris')
case_name                ('In Re Maui Indus. Loan & Finance Co.', 'Field v. Trust Estate of Rose Kepoikai (In re Maui Industrial Loan & Finance Co.)')
docket_number            ('19-00179', 'Bankruptcy No. 10-00235; Adversary Nos. 10-90126, 10-90130, 10-90131, 10-90137')

/storage/harvard_corpus/law.free.cap.br.463/499.3672479.json
----
Cluster: 2195441 Harvard_id: 3672479
judges           ('Robert J. Faris', 'Faris')
case_name                ('In Re Maui Indus. Loan & Finance Co.', 'Field v. Levin (In re Maui Industrial Loan & Finance Co.)')
docket_number            ('19-00180', 'Bankruptcy No. 10-00235; Adversary No. 11-90032')

In both examples, the judge information is fuller and better in CL, but the docket numbers dont appear correct.

Now I checked the first one and the data source is RECAP and SCRAPER, so I'm not sure how to argue with the docket number generated here.

This will be a challenge. Perhaps we try to confirm the docket numbers via the original source material, and the content which seems to validate the Harvard data.

quevon24 commented 1 year ago

For the judges, I saw that many of them are not stored in the clusters or the information in the cluster is incomplete, we can save them directly as they are obtained but also implement the normalization of these within the import process to add them to panel.

For the attorneys, i found that in many cases the field is empty but we have the data in the json file, also, do we expect to extract only the names or only store the text? Because in many cases the attorneys comes with a larger text like:

Christopher J. Velez, of Law Office of Christopher J. Velez, of Garden City, was on the briefs for appellant., Tamara S. Hicks, assistant county attorney, Susan Lynn Hillier Richmeier, county attorney, and Derek Schmidt, attorney general, were on the brief for appellee.

mlissner commented 1 year ago

I think if we start populating docket_number_core, that should help with a lot of the docket number cleanup, but it's still a really lousy field. Just in the federal district courts, the same docket number can be written in a bunch of ways.

This docket number consists of five parts:

The only parts that matter are the 19 and the 99999. The rest of the parts can be omitted, so these (and likely more) variations are all valid:

The docket_number_core field reduces these variations to just the core part of the docket number: 1999999. It works for district court dockets, but is blank for all other courts.

flooie commented 1 year ago

New issue to investigate,

In at least one opinion, the opinion we have is Split into the Opinion and an Addendum opinion.

https://www.courtlistener.com/opinion/3246772/go/

The harvard data makes no distinction although it does have a tag for the addendum.

 <p id=\"ARX\">

and it does have a separate <author> tag for the addendum. which may be the way to identify them.

I'm not sure how Courtlistener would handle the situation if we were to add a new Combined opinion to the mix. Would it take over.

https://ia803100.us.archive.org/33/items/law.free.cap.ala-app.19/380.8825727.json

Further research is needed.