freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 150 forks source link

Handle malformed, empty opinions with Harvard importer #2311

Closed quevon24 closed 2 years ago

quevon24 commented 2 years ago

After inspecting the latest Harvard importer logs, I discovered that several cases apparently don’t have an opinion, to be exact 45494 cases with empty opinions. Here is the complete file list of those cases with empty opinions:

no_opinion_errors.csv

These cases can be divided into two categories:

  1. Files that don’t have any opinion on any of their tags within the case body.
  2. Files that are likely to have an opinion on any tags within the case body.

For the first category I can tell you that in fact, there are cases with no opinion in any of their tags within the case body, so the question is: What we should do with this situation? Because this is skipped from Harvard importer.

Here is an example from https://cite.case.law/ga-app/129/755/:

<?xml version='1.0' encoding='utf-8'?>
<casebody firstpage=\"755\" lastpage=\"755\"
    xmlns=\"http://nrs.harvard.edu/urn-3:HLS.Libr.US_Case_Law.Schema.Case_Body:v1\"> 
    <docketnumber id=\"b787-10\">48546.</docketnumber>
    <parties id=\"Asmt\">Deloach v. MAURER.</parties>
</casebody>

As you can see in the example above, there is not opinion, the case body is almost empty.

The second category is more complex because it’s possible to find an opinion, the problem is that is not in tag, the opinion can be found in several different tags, for example in these tags:

The problem here is that at the moment I couldn’t find any pattern to find the opinion easily because is so spread out in the above tags

I also did some tests with a small sample to try to find a pattern based on reporter and volume thinking that maybe there’s a relationship between the tag where we can find the opinion and the reporter and volume, but so far, I couldn’t find anything.

Here are some examples of opinions within other tags:

In https://api.case.law/v1/cases/3550137/ <attorneys id=\"b696-8\">W. R. Walker, for petitioner. W. H. Simpson, pro se. Per curiam. Petition dismissed.</attorneys>

In https://api.case.law/v1/cases/8662236/ <summary id=\"A2h\">Affirmed.</summary>

Some of them are mixed, for example in https://api.case.law/v1/cases/106285/ <attorneys id=\"a0p-dedup-0\">Frederick J. Francis, for appellees. Decree affirmed.</attorneys>

In https://api.case.law/v1/cases/1648373/ <disposition id=\"A1f\">Affirmed.</disposition>

The clearest solution is to fix all those cases manually using opinionated fixes, the problem is that this work must be done manually and this can take forever due to the number of cases with this problem. So we can probably rule this out.

I welcome any suggestions on how to tackle this issue

mlissner commented 2 years ago

How many lack opinion text completely vs have it elsewhere in their XML?

Looks like the example you gave of lacking the opinion body completely should have some text, as you can see here: https://law.justia.com/cases/georgia/court-of-appeals/1993/a93a2556.html

flooie commented 2 years ago

Very few have no opinions. Almost all have it just elsewhere. Ive only seen 2-3 that legitimately dont have an opinion.

mlissner commented 2 years ago

Very few meaning 5 or 500, say?

flooie commented 2 years ago

I would suspect closer to 5. thats my guess based on the data ive seen

mlissner commented 2 years ago

So sounds like those can get fixed manually, if we can identify them. I'll butt out though, unless y'all want help brainstorming this one.

quevon24 commented 2 years ago

How many lack opinion text completely vs have it elsewhere in their XML?

Looks like the example you gave of lacking the opinion body completely should have some text, as you can see here: https://law.justia.com/cases/georgia/court-of-appeals/1993/a93a2556.html

I believe the only way to know how many completely lack opinion text is to download every single case from the list, and the parse each one, the problem with that is we have the restriction of 500 cases per day.

And for the example you are showing i think it's different than the one i wrote: "DeLoach v. Maurer, 129 Ga. App. 755" and the one you shared is: "Deloach v. Hewes, 211 Ga. App. 321", but actually i think that case has an opinion (for some reason, citation doesn't match with the one in case.law): https://casetext.com/case/deloach-v-maurer?sort=relevance&type=case&tab=keyword&jxs=&resultsNav=false

I'm working with some ideas to figure out where the opinion might be within the case body

flooie commented 2 years ago

@quevon24 just use the copies in the IA archive.

mlissner commented 2 years ago

I believe the only way to know how many completely lack opinion text is to download every single case from the list, and the parse each one,

Um, but even if we did that, we still wouldn't know which ones have the opinion text elsewhere in the body, right?

quevon24 commented 2 years ago

the

@quevon24 just use the copies in the IA archive.

Forgot we have those, sorry 😅

I'll let you know when I finish counting

quevon24 commented 2 years ago

I believe the only way to know how many completely lack opinion text is to download every single case from the list, and the parse each one,

Um, but even if we did that, we still wouldn't know which ones have the opinion text elsewhere in the body, right?

Right, because it is very inconsistent, you can find the opinion in one tag, but some times it's in a completely different tag, right now i'm trying to find a way to simplify that work

quevon24 commented 2 years ago

I accept any suggestions 👍

mlissner commented 2 years ago

I don't have any deep insights here, but I think if you started looking for obvious patterns you might sort out a big majority of these. For example, if the tags always start with J. $SOME_NAME presiding, you'd get a tranche of them that way. I guess we need some random sampling? Bill?

flooie commented 2 years ago

yes. But this doesnt really seem like a hard problem. I think we'll be able to identify maybe... 25 permutations and just adjust for each. Test it.. . generate the correct xml and upload it to opinionated.

quevon24 commented 2 years ago

I don't have any deep insights here, but I think if you started looking for obvious patterns you might sort out a big majority of these. For example, if the tags always start with J. $SOME_NAME presiding, you'd get a tranche of them that way. I guess we need some random sampling? Bill?

For the moment i'm doing this: i'm extracting the text from the cases that have an opinion in opinion tag, then i'm using nltk to get POS tags from each word from the opinion, and with that I remove unnecesary words like adverbs, pronouns, etc. With this I can create a list of common words used in opinions, and from there probably find the most common words and use those words to check each tag in the case body of cases with empty opinions

quevon24 commented 2 years ago

I don't have any deep insights here, but I think if you started looking for obvious patterns you might sort out a big majority of these. For example, if the tags always start with J. $SOME_NAME presiding, you'd get a tranche of them that way. I guess we need some random sampling? Bill?

I think this is a good solution for some specific cases, i found opinions with small texts like: