freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
354 stars 105 forks source link

Enhance Juriscraper to Support Bundling of Separate Opinions #883

Open flooie opened 7 months ago

flooie commented 7 months ago

Issue Description:

Currently, a handful of courts provide separate opinions in their opinion lists, which are not currently supported by juriscraper and CourtListener (CL). This lack of support for bundling separate opinions can lead to incomplete or segmented case information being scraped and processed.

Suggested Enhancement:

I propose updating juriscraper to allow for the bundling of separate opinions. This enhancement would ensure that all opinions related to a case are collected and processed together, providing a more comprehensive view of the case proceedings and decisions.

Courts: (in progress list)

mlissner commented 7 months ago

To be clear here, what you're proposing is upgrading Juriscraper to return multiple opinion objects under one key, like we have with clusters/opinions in CL itself, right? Assuming so, can you provide a link or screenshot or something as an example?

flooie commented 7 months ago

Yes - I was working this thru in my head - before I laid out my vision.

flooie commented 7 months ago
{'date': '2/14/2023', 'docket': 'SC20164', 'name': 'State v. Juan A. G.-P.', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20627', 'name': 'CT Freedom Alliance, LLC v. Dept. of Education', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20633', 'name': 'Devine v. Fusaro', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20679', 'name': 'Grant v. Commissioner of Correction', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '010combined'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '030concurrence'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '040dissent'}

{'date': '1/24/2023', 'docket': 'SC20597', 'name': 'Solon v. Slater', 'opinion_type': '010combined'}

{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '010combined'}
{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '030concurrence'}

I fixed and rewrote part of Connecticut - to take advantage of the opinion_type changes. Here are some excepts from self.cases

We can take these results and either call a method to combine the multiple opinions here into clusters and only slightly modify CL to save each opinion together with the cluster

mlissner commented 7 months ago

I'd expect this to mirror the fields in CL pretty closely. Why not do the joining in JS so that CL has a nice JSON object of clusters with nested opinions?

grossir commented 7 months ago

I checked the changes required on Courtlistener to support this new paradigm, while still supporting the legacy scrapers. I found the following:

Here is a branch where I show the changes needed in CL, which turned out rather small. This is still a concept, would have to be tested and improved

https://github.com/freelawproject/courtlistener/compare/main...grossir:courtlistener:support_juriscraper_nested_objects?expand=1

mlissner commented 7 months ago

Gianfranco, it's very OK to change CL as part of this, if it means making the interface better while hitting our design requirements. I'd rather do this now and have something we like instead of being stuck with half measures. Does that change your thinking about approach?

grossir commented 6 months ago

It took quite some time but I have a draft working on integration with Courtlistener (which will be another parallel PR) First I will paste some nice screenshots, then I will dive into some problems and opportunities I found while working on this

Results

I used tex as a working scraper to test the new class. As a useful example, we have this recent Supreme Court case, which has a OpinionCluster of 3 opinions. This is how the cluster looks on my local docker env: image image image

How it currently looks on Courtlistener image

Also, the scraper captures search_originating_court_information image and extra columns for our usual objects

Implementation details

It's better to look at the code, even if there is still pending work. I have written comments extensively.

https://github.com/freelawproject/juriscraper/pull/952

On Courtlistener: https://github.com/freelawproject/courtlistener/pull/3864

Besides the "code" code review, I will need some "data" code review, to see if I am using properly the nature_of_suit, cause, opinion.type, etc fields

Of note, I found a way to keep tests of secondary/deferred page's examples. For tex it was as simple as tweaking the href leading to the secondary page, so that it points to the precise example file.

Pending work

I still have a bunch of bugs to solve and tests to write for this to be mergeable

Further work

There is a clear opportunity to scrape people_db objects, like Person, Party, Attorney, and to support them in cl_scrape_opinions. However, this would take more work and testing since lookups for this objects have to be used

Some bugs found on the way

Bugs on OpinionSite[Linear] integration with CL: Attributes that we can return but are never picked up in CL (defined on OpinionSite class)

            "dispositions",
            "causes",
            "divisions",
            "docket_attachment_numbers",
            "docket_document_numbers",
            "lower_courts",
            "lower_court_judges",
            "lower_court_numbers",

These are actually used on some sources, so we are not inserting data we do collect. For example, lower_courts is used in tenn, nev, ind, bap1, etc