Focus on NY State - Githubissues

flooie commented 3 years ago

This question was posed, how difficult would it be to upgrade our parsers and integrate all our NY state datasets to keep a very good or perfect set of New York case law.

For comparisons sake, According to lexis they have 1,679,660 Opinions broken down by court

New York                    - 1,679,660
- N.Y. Ct. of App.            - 290,653
- N.Y. App. Div. 1st Dept.    - 338,831
- N.Y. App. Div. 2nd Dept.    - 288,188
- N.Y. App. Div. 3rd Dept.    - 121,156
- N.Y. App. Div. 4th Dept.    - 135,351
- N.Y. App. Term              - 34,412
- N.Y. Sup. Ct.               - 223,510 
- N.Y. County Ct.             - 7,724
- NYC Civ. Ct.                - 11,837
- NYC Crim. Ct.               - 3,585
- N.Y. Claims Ct.             - 3,386
- N.Y. Fam. Ct.               - 4,145
- N.Y. Sur. Ct.               - 193,483 
- N.Y. Dist. Ct.              - 2,613
- N.Y. City Ct.               - 4,900
- N.Y. Justice Ct.            - 790
- N.Y. Hist. Ct              - 15,096

about 1.2 million are reported opinions.

Harvard Case.law Reports Meanwhile Harvard reports 1,111,794 — opinions with recent reporters ending in these years. Appellate Division Reports (2002-2018) New York Reports (2003-2017) New York Miscellaneous Reports (2002-2017)

This should be overcome with the vast number of cases going back to 2003 on the New York reporter website that can bridge the gap.

The main questions/todos are updating the courts- integrate maybe courts-db. Update scrapers to get the slip opinions and motion decisions. And how long it would take to ingest the million new opinions into our system.

To be continued...

mlissner commented 3 years ago

SUPER helpful. I just talked to the folks that want to use us as their data provider. A couple notes:

They've got the last ten years of stuff, so that's good.
But they suggest getting content directly from the reporter is easier and better. They're going to send me info about this.
There's also an FTP site, apparently, that has opinions in HTML form, but it's unclear how good the metadata would be on that.

More soon, I think.

flooie commented 3 years ago

I think we are in some cases getting the content directly from the reporter but not always and I agree.

flooie commented 3 years ago

NOTES on getting NY up to speed.

Ingesting the Harvard data would make our back catalog of cases essentially 100% and comparable to the big guys.

I would start by focusing on the following reporters.

A.D.: Appellate Division Reports (1714-1991)
A.D.2d: Appellate Division Reports (1939-2003)
A.D.3d: Appellate Division Reports (2002-2018)
Misc.: New York Miscellaneous Reports (1808-1955)
Misc.2d: New York Miscellaneous Reports (1878-2003)
Misc.3d: New York Miscellaneous Reports (2002-2017)
N.Y.: New York Reports (1800-1997)
N.Y. 2d: New York Reports (1956-2003)
N.Y.3d: New York Reports (2003-2017)
N.Y.S.: West's New York Supplement (1832-1990)

We would need to add a number of courts to our NY court catalog, but that would be a minor inconvenience. In the long run I think we also need to increase the Courts object to include location specific information about a court- (ie. Supreme Court for XXX County not just Supreme Court). Generally, adding a couple courts would be relatively easy.

After ingesting the back catalog we would want to ingest the entire slip opinion database -- at Law Reporting Bureau

The entire back catalog has been downloaded and organized by metadata. The metadata includes (sometimes) judge name, case title, slip opinion cite, official citation, dates etc. They appear to be released on the same day as the opinions are released, say for example, on the Court of Appeals website. The only downside is the slip opinions do not always have the PDF version released by the courts. The plus though is the HTML is nicely formatted for ingestion.

We would want to switch our scraper to the slip opinion website, and I think do a second pass at a later date to get any official citations added to cases after the fact which I think is something we are sorely missing. In fact I think we should use our current case collection and use this site to find the official or slip opinion citations for our catalog of cases.

Downsides

We would have a slightly more unwieldy court picker, but hopefully that is a temporary problem.
We would need to update opinions currently in our system to fit into the updated and increased court selections (ie appellate cases (may) need to be slotted into appellate court 1 2 3 or 4) Writing a two part scraper to match the original PDFs to the slip opinions could have the occasional problem merging.

It would take time to get the entire dataset from IA to ingest, which may be the slowest part of the process. I think it would take about a week to simply get the data into the system but I see no major obstacles, besides writing code to match slip opinions and citations to the cases we already have in the system.

mlissner commented 3 years ago

This is excellent analysis. Thank you very much.

freelawproject / courtlistener

Focus on NY State #1629

Downsides