Open flooie opened 9 months ago
I might be missing it, but I think your code sample is what's in the system now, right? Can you show me the design you would propose instead?
I was just pasting it for reference. I think we should have some version of
Single author. Or trial opinion
Sure, makes sense, though I suspect we'll have about 10M items that should have that value once it's available.
I'll also add that once in a while you do see panels at district courts, but it's rare.
Do you have any examples ?
At first I thought I didn't, but it turns out we have an open bug for this! https://github.com/freelawproject/courtlistener/issues/1293
Thanks
In researching the issue of how to smartly bring RECAP "Opinions" to the Case Law database I thought this would be relatively simple. Find the free on pacer documents and just add them in as OpinionClusters to the docket.
Here are the main challenges.
The easy ones are labeled as Opinions or Opinion / Memorandums. It gets trickier when its Order and Memorandum or the string is long and contains a brief description before identifying the type of document.
After reviewing Lexis I found them to be supplying a much broader range of documents, or documents with a more verbose title. Long Order names with no reference to memorandums often contain "opinions" as best as I can tell. In these cases I would lean on a mix of length and content to identify them properly.
This comes with its own challenges, for example, avoiding attachments with proposed orders or memorandums in support of the plaintiff/defendant's position.
Second challenge are corrupt PDFs of mixed PDFs. Corrupt are corrupt see various Social Security cases
Or the more frustrating Mixed PDFs which contain only the PACER page stamp on each page while the remaining document remains un-extracted.
There is also a challenge of appended documents or attachments, that I fear would pollute the case law search tool.
Please dont be merging - except yes I have found recap entries that map to cases in our database.
We need to add a RECAP/PACER source for Opinion Cluster. We have one for docket but not opinion cluster and if we start merging in from RECAP I think it should stand as RECAP and not as the generic court website.
I also think, but not sure if we should change the PRECEDENTIAL_STATUS
's to include ORDER or MEMORANDUM or how we could accurately identify the difference.
The main challenge remains identify what is an opinion and I think with a relatively low risk we should be able to identify what is and isnt an opinion - atleast in the CACD.
Ive been tweaking an every growingly complex if, else etc code to slim down what is and isnt an opinion but it remains a work in progress.
We're facing a challenge in expanding our case law database to match the comprehensiveness of established platforms like LexisNexis and Westlaw. Users likely expect our case law collection to be comparable, and including what they categorize as such seems reasonable.
However, a closer look at the discrepancy between what Lexis labels as case law (our primary source for this analysis) and our own database reveals a significant difference.
The Numbers Gap: Millions of Missing Opinions?
There's a striking gap in the number of listed federal court opinions. Our database, along with Harvard's, shows roughly 700,000 opinions, while Lexis boasts a staggering 4.4 million. This sixfold difference emerged around 2004 and likely stems from Lexis including PACER documents (indicated by the chart/graph). 
Zooming in on a specific court, Lexis lists over half a million opinions from California District Courts, with 197,151 belonging solely to the Central District Court (CACD). Our database, on the other hand, reflects only 4,936 opinions from CACD. Interestingly, we do capture a broader picture by including 161,383 cases and 411,894 documents from those cases. If we can effectively identify the opinions within this collection, we can significantly close the gap with Lexis, potentially even surpassing them in some instances.
The core issue lies in the ambiguous definition of "case law" and how Lexis and Westlaw interpret it. It's clear that not every judicial document qualifies. While free PACER documents and a basic combination of opinions, memorandums, and orders might be part of the mix, it's not the whole picture.
These larger platforms seem to include any judge-written document that discusses legal matters, encompassing even short orders and documents labeled as minutes. However, they appear to exclude more routine trial court documents and orders, such as those issued at the beginning of a trial or other rulings.
An example is CACD case #103, which offers legal analysis but doesnt appear to be an opinion at first glance. 
The opposite is true for Valencia v Cash, the opinion cited in lexis and west is much more basic. Valencia v. Cash, 2011 U.S. Dist. LEXIS 108624, 2011 WL 4403073 - But both giants have them and have them with a citation Valencia v. Cash, 2011 U.S. Dist. LEXIS 108624, 2011 WL 4403073
Our goal is to avoid cluttering the database with irrelevant filings like motions, briefs, or proposed orders. Notably, Lexis itself stores a significant number of these non-case-law documents: roughly 16 million miscellaneous trial court documents, 6 million motions, 5 million pleadings, and 250,000 briefs.
I've explored various methods to distinguish opinions from other documents. These include analyzing documents which are linked to a judge or judges, using the document description, the docket entry details, and the document contents itself (if well-formatted). Unfortunately, each approach, and even combinations, have limitations.
My gut tells me that the best approach would be to train a model on all of the type of documents we have building up a ML model that can make good distinctions and pairing that with some good algorithms to get it right. But I know that is going to give you @mlissner heartburn.
Thanks for the details. Summarizing our call, I think what we concluded is:
An ML approach would be hard, and is probably too hard for us for now. Just the labeling to train the model would be time consuming. It would also face the dreaded "proposed order" problem. Though that's probably solvable, there are probably another 20 such problems that'd cause over- or under-inclusion. Still, this may be the best answer, if we could find the resources to do it.
We could just dump anything that's free in PACER into the case law DB. From some early sampling, this looks both over- and under-inclusive too. Lexis had one document from the case we looked at randomly. The doc was in RECAP, but it wasn't marked as free, so we'd have missed it. OTOH, we do have three docs for the case that Lexis doesn't that are marked as free. So maybe Lexis missed those and we'd miss the one Lexis had. Not ideal for anybody.
So we need to do more thinking about this, and one way to do that is by talking to our clients, advisors, etc, so we'll start that process and see if we can devise a good way forward.
This is a good opportunity to provide an update on this issue. @mlissner
Current Progress
Next Steps
Challenge with Criminal Opinions
One tricky challenge remains with criminal opinions: the case name on the docket does not always reflect what the case opinion of the document should be.
For example, in the case USA v. CADDEN, there are 14 defendants and 15 dockets: 1 parent docket and 14 child dockets. Every entry in the combined cases goes into the parent docket. Most docket entries appear on every docket, and every docket entry appears on the parent. However, there are often rulings or motions specifically for a subset of the opinions.
Take this June 2019 ruling as an example: June 2019 Ruling
In this ruling, it is attributed only to two defendants: Gregory Conigliaro and Sharon Carter. The judge labeled this case as UNITED STATES OF AMERICA v. GREGORY CONIGLIARO and SHARON CARTER, which is what we would expect to use in the case law database. However, if we attempted to import this opinion, we would label it as USA v. CADDEN.
Lexis lists this document as US. v. Conigliaro.
This discrepancy needs to be addressed to ensure accurate labeling in our case law database.
This leaves us with three options
Honestly 3. seems like the only logical/doable option to me. This only affects cases with multiple defendants so its possible to do all the single defendant cases - but that would require checking in on the hidden api - before each docket just to be sure - which seems like something we might not want to overuse.
Thanks for the update. Have you tried using the PossibleCaseNumberAPI to get the correct case name based on the pacer_case_id, say? I was going to check myself, but your example is in Mass., and they seem to be blocking me/Canada.
Yes @mlissner. Unfortunately, it is not uncommon to have a document assigned to someone who is not involved in that opinion. Or more specifically, I've seen examples where a document is attributed to 6 of 13 defendants, but it be labeled globally to all defendants.
I have also seen text labeling an opinion specifically to one defendant, but then listing all four in the case caption.
But using possible case number endpoint would allow for us to link documents more or less across defendants.
Can you share some examples so we can try to distill the problem a bit? Maybe outside of MA, since their server is wonky?
@flooie asked me to check for any needed updates for the recap_into_opinions
command before making a full re-run and setting it for periodic runs on new documents.
The only major question I have, which is related to the discussion in this issue, is: should we try to filter somehow for opinions? Right now the command will ingest any RecapDocument linked to the target courts, and we get documents with titles like "opinion", "opinion and order", etc; but also stuff that clearly? are not opinions, such as plain "Orders", "REPORT AND RECOMMENDATION" and "ORDER TO PAY ATTORNEY FEES"
We use a different standard for other Case Law scrapers, trying to target only opinions and skipping orders and other stuff; even if that standard is not always fulfilled
@grossir and I talked offline about this and are not overly concerned about this. These qualify as opinions and include discussions of case law. As well as being marked as free on pacer and available.
Im working on automatically adding Recap Documents Opinions as Opinions in the Case Law database, and it dawns on me that we need a new Opinion Type.
Our current set - basically all refers to multiple authors - when a ruling from a district court judge is always going to be unanimous, combined and lead. Instead I think we should have Single Judge Opinions.
@mlissner what do you think?