freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
553 stars 151 forks source link

RECAP into Opinions #3790

Open flooie opened 9 months ago

flooie commented 9 months ago

Im working on automatically adding Recap Documents Opinions as Opinions in the Case Law database, and it dawns on me that we need a new Opinion Type.

Our current set - basically all refers to multiple authors - when a ruling from a district court judge is always going to be unanimous, combined and lead. Instead I think we should have Single Judge Opinions.

@mlissner what do you think?

class Opinion(AbstractDateTimeModel):
    COMBINED = "010combined"
    UNANIMOUS = "015unamimous"
    LEAD = "020lead"
    PLURALITY = "025plurality"
    CONCURRENCE = "030concurrence"
    CONCUR_IN_PART = "035concurrenceinpart"
    DISSENT = "040dissent"
    ADDENDUM = "050addendum"
    REMITTUR = "060remittitur"
    REHEARING = "070rehearing"
    ON_THE_MERITS = "080onthemerits"
    ON_MOTION_TO_STRIKE = "090onmotiontostrike"
mlissner commented 9 months ago

I might be missing it, but I think your code sample is what's in the system now, right? Can you show me the design you would propose instead?

flooie commented 9 months ago

I was just pasting it for reference. I think we should have some version of

Single author. Or trial opinion

mlissner commented 9 months ago

Sure, makes sense, though I suspect we'll have about 10M items that should have that value once it's available.

I'll also add that once in a while you do see panels at district courts, but it's rare.

flooie commented 9 months ago

Do you have any examples ?

mlissner commented 9 months ago

At first I thought I didn't, but it turns out we have an open bug for this! https://github.com/freelawproject/courtlistener/issues/1293

flooie commented 9 months ago

Thanks

flooie commented 8 months ago

In researching the issue of how to smartly bring RECAP "Opinions" to the Case Law database I thought this would be relatively simple. Find the free on pacer documents and just add them in as OpinionClusters to the docket.

Here are the main challenges.

What is an Opinion?

The easy ones are labeled as Opinions or Opinion / Memorandums. It gets trickier when its Order and Memorandum or the string is long and contains a brief description before identifying the type of document.
After reviewing Lexis I found them to be supplying a much broader range of documents, or documents with a more verbose title. Long Order names with no reference to memorandums often contain "opinions" as best as I can tell. In these cases I would lean on a mix of length and content to identify them properly.

This comes with its own challenges, for example, avoiding attachments with proposed orders or memorandums in support of the plaintiff/defendant's position.

Corruptions

Second challenge are corrupt PDFs of mixed PDFs. Corrupt are corrupt see various Social Security cases

Or the more frustrating Mixed PDFs which contain only the PACER page stamp on each page while the remaining document remains un-extracted.

There is also a challenge of appended documents or attachments, that I fear would pollute the case law search tool.

Other challengers include

  1. Opinions from the Court of Appeals posted on dockets
  2. Opinions from Supreme Courts (specifically Arkansas)
  3. Transferred cases with opinions from other courts
  4. Opinions with multiple author district panels

Merging - oh dear not merging

Please dont be merging - except yes I have found recap entries that map to cases in our database.

Model Changes

We need to add a RECAP/PACER source for Opinion Cluster. We have one for docket but not opinion cluster and if we start merging in from RECAP I think it should stand as RECAP and not as the generic court website.

I also think, but not sure if we should change the PRECEDENTIAL_STATUS's to include ORDER or MEMORANDUM or how we could accurately identify the difference.

wrap up

The main challenge remains identify what is an opinion and I think with a relatively low risk we should be able to identify what is and isnt an opinion - atleast in the CACD.

Ive been tweaking an every growingly complex if, else etc code to slim down what is and isnt an opinion but it remains a work in progress.

flooie commented 8 months ago

The Case Law Conundrum: Matching the Big Guys

We're facing a challenge in expanding our case law database to match the comprehensiveness of established platforms like LexisNexis and Westlaw. Users likely expect our case law collection to be comparable, and including what they categorize as such seems reasonable.

However, a closer look at the discrepancy between what Lexis labels as case law (our primary source for this analysis) and our own database reveals a significant difference.

The Numbers Gap: Millions of Missing Opinions?

There's a striking gap in the number of listed federal court opinions. Our database, along with Harvard's, shows roughly 700,000 opinions, while Lexis boasts a staggering 4.4 million. This sixfold difference emerged around 2004 and likely stems from Lexis including PACER documents (indicated by the chart/graph). 

Timeline

Zooming in on a specific court, Lexis lists over half a million opinions from California District Courts, with 197,151 belonging solely to the Central District Court (CACD). Our database, on the other hand, reflects only 4,936 opinions from CACD. Interestingly, we do capture a broader picture by including 161,383 cases and 411,894 documents from those cases. If we can effectively identify the opinions within this collection, we can significantly close the gap with Lexis, potentially even surpassing them in some instances.

Defining the Elusive "Case Law"

The core issue lies in the ambiguous definition of "case law" and how Lexis and Westlaw interpret it. It's clear that not every judicial document qualifies. While free PACER documents and a basic combination of opinions, memorandums, and orders might be part of the mix, it's not the whole picture.

These larger platforms seem to include any judge-written document that discusses legal matters, encompassing even short orders and documents labeled as minutes. However, they appear to exclude more routine trial court documents and orders, such as those issued at the beginning of a trial or other rulings.

An example is CACD case #103, which offers legal analysis but doesnt appear to be an opinion at first glance. 

Screenshot 2024-03-27 at 10 57 36 AM

The opposite is true for Valencia v Cash, the opinion cited in lexis and west is much more basic. Valencia v. Cash, 2011 U.S. Dist. LEXIS 108624, 2011 WL 4403073 - But both giants have them and have them with a citation Valencia v. Cash, 2011 U.S. Dist. LEXIS 108624, 2011 WL 4403073

Balancing Comprehensiveness with Accuracy

Our goal is to avoid cluttering the database with irrelevant filings like motions, briefs, or proposed orders. Notably, Lexis itself stores a significant number of these non-case-law documents: roughly 16 million miscellaneous trial court documents, 6 million motions, 5 million pleadings, and 250,000 briefs.

I've explored various methods to distinguish opinions from other documents. These include analyzing documents which are linked to a judge or judges, using the document description, the docket entry details, and the document contents itself (if well-formatted). Unfortunately, each approach, and even combinations, have limitations.

My gut tells me that the best approach would be to train a model on all of the type of documents we have building up a ML model that can make good distinctions and pairing that with some good algorithms to get it right. But I know that is going to give you @mlissner heartburn.

mlissner commented 8 months ago

Thanks for the details. Summarizing our call, I think what we concluded is:

So we need to do more thinking about this, and one way to do that is by talking to our clients, advisors, etc, so we'll start that process and see if we can devise a good way forward.

flooie commented 5 months ago

This is a good opportunity to provide an update on this issue. @mlissner

Current Progress

Next Steps

Challenge with Criminal Opinions

One tricky challenge remains with criminal opinions: the case name on the docket does not always reflect what the case opinion of the document should be.

For example, in the case USA v. CADDEN, there are 14 defendants and 15 dockets: 1 parent docket and 14 child dockets. Every entry in the combined cases goes into the parent docket. Most docket entries appear on every docket, and every docket entry appears on the parent. However, there are often rulings or motions specifically for a subset of the opinions.

Take this June 2019 ruling as an example: June 2019 Ruling

In this ruling, it is attributed only to two defendants: Gregory Conigliaro and Sharon Carter. The judge labeled this case as UNITED STATES OF AMERICA v. GREGORY CONIGLIARO and SHARON CARTER, which is what we would expect to use in the case law database. However, if we attempted to import this opinion, we would label it as USA v. CADDEN.

Lexis lists this document as US. v. Conigliaro.

This discrepancy needs to be addressed to ensure accurate labeling in our case law database.

This leaves us with three options

  1. Use the global case name US v. Cadden (bad I think)
  2. Attempt to suss out the predicted case name by looking at the names of the defendants identified in the document description. (maybe it could work but it seems difficult to get 100% right)
  3. Lastly we could parse out the case caption from the document.

Honestly 3. seems like the only logical/doable option to me. This only affects cases with multiple defendants so its possible to do all the single defendant cases - but that would require checking in on the hidden api - before each docket just to be sure - which seems like something we might not want to overuse.

mlissner commented 5 months ago

Thanks for the update. Have you tried using the PossibleCaseNumberAPI to get the correct case name based on the pacer_case_id, say? I was going to check myself, but your example is in Mass., and they seem to be blocking me/Canada.

flooie commented 5 months ago

Yes @mlissner. Unfortunately, it is not uncommon to have a document assigned to someone who is not involved in that opinion. Or more specifically, I've seen examples where a document is attributed to 6 of 13 defendants, but it be labeled globally to all defendants.

I have also seen text labeling an opinion specifically to one defendant, but then listing all four in the case caption.

But using possible case number endpoint would allow for us to link documents more or less across defendants.

mlissner commented 5 months ago

Can you share some examples so we can try to distill the problem a bit? Maybe outside of MA, since their server is wonky?

grossir commented 1 month ago

@flooie asked me to check for any needed updates for the recap_into_opinions command before making a full re-run and setting it for periodic runs on new documents.

The only major question I have, which is related to the discussion in this issue, is: should we try to filter somehow for opinions? Right now the command will ingest any RecapDocument linked to the target courts, and we get documents with titles like "opinion", "opinion and order", etc; but also stuff that clearly? are not opinions, such as plain "Orders", "REPORT AND RECOMMENDATION" and "ORDER TO PAY ATTORNEY FEES"

We use a different standard for other Case Law scrapers, trying to target only opinions and skipping orders and other stuff; even if that standard is not always fulfilled

flooie commented 1 month ago

@grossir and I talked offline about this and are not overly concerned about this. These qualify as opinions and include discussions of case law. As well as being marked as free on pacer and available.