Enhance Juriscraper to Support Bundling of Separate Opinions

flooie commented 7 months ago

Issue Description:

Currently, a handful of courts provide separate opinions in their opinion lists, which are not currently supported by juriscraper and CourtListener (CL). This lack of support for bundling separate opinions can lead to incomplete or segmented case information being scraped and processed.

Suggested Enhancement:

I propose updating juriscraper to allow for the bundling of separate opinions. This enhancement would ensure that all opinions related to a case are collected and processed together, providing a more comprehensive view of the case proceedings and decisions.

Courts: (in progress list)

[ ] Connecticut
[ ] Conn. Court of Appeals
[ ] West Virginia
[ ] West Virginia Court of Appeals
[ ] Michigan Court of Appeals
[ ] Tennessee Supreme Court
[ ] Texas Supreme
[ ] Texas Court of Appeals

mlissner commented 7 months ago

To be clear here, what you're proposing is upgrading Juriscraper to return multiple opinion objects under one key, like we have with clusters/opinions in CL itself, right? Assuming so, can you provide a link or screenshot or something as an example?

flooie commented 7 months ago

Yes - I was working this thru in my head - before I laid out my vision.

flooie commented 7 months ago

{'date': '2/14/2023', 'docket': 'SC20164', 'name': 'State v. Juan A. G.-P.', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20627', 'name': 'CT Freedom Alliance, LLC v. Dept. of Education', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20633', 'name': 'Devine v. Fusaro', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20679', 'name': 'Grant v. Commissioner of Correction', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '010combined'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '030concurrence'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '040dissent'}

{'date': '1/24/2023', 'docket': 'SC20597', 'name': 'Solon v. Slater', 'opinion_type': '010combined'}

{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '010combined'}
{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '030concurrence'}

I fixed and rewrote part of Connecticut - to take advantage of the opinion_type changes. Here are some excepts from self.cases

We can take these results and either call a method to combine the multiple opinions here into clusters and only slightly modify CL to save each opinion together with the cluster

mlissner commented 7 months ago

I'd expect this to mirror the fields in CL pretty closely. Why not do the joining in JS so that CL has a nice JSON object of clusters with nested opinions?

grossir commented 7 months ago

I checked the changes required on Courtlistener to support this new paradigm, while still supporting the legacy scrapers. I found the following:

We can return a nested object, but we must keep a minimal interface (dict keys) for compatibility with cl_scrape_opinions tasks of dup checking

{
"Docket": {...},
"OpinionCluster": {...},
"Opinion": {...},
"case_names": "",  # for site.hash dup checking
"download_urls: "",  # for sha1 checking of url content 
"precedential_statuses": "" # for sha1 checking in case of nev
"case_dates": "", # for sorting and dup checking,
}

Even if we return objects of the following shape we would have to return an item for each opinion (because of dup checking), causing a somewhat ugly duplication

{
"OpinionCluster": {
         "Opinion": [
                 {...},
                 {...}
          ]
}
}

Returning the objects ready to create means we must pass all required values.
This object approach allows greater flexibility to add more fields as we found them, without having to modify CL each time
The objects returned can be validated by a JSON Schema as discussed in #838

Here is a branch where I show the changes needed in CL, which turned out rather small. This is still a concept, would have to be tested and improved

https://github.com/freelawproject/courtlistener/compare/main...grossir:courtlistener:support_juriscraper_nested_objects?expand=1

mlissner commented 7 months ago

Gianfranco, it's very OK to change CL as part of this, if it means making the interface better while hitting our design requirements. I'd rather do this now and have something we like instead of being stuck with half measures. Does that change your thinking about approach?

grossir commented 6 months ago

It took quite some time but I have a draft working on integration with Courtlistener (which will be another parallel PR) First I will paste some nice screenshots, then I will dive into some problems and opportunities I found while working on this

Results

I used tex as a working scraper to test the new class. As a useful example, we have this recent Supreme Court case, which has a OpinionCluster of 3 opinions. This is how the cluster looks on my local docker env:

How it currently looks on Courtlistener

Also, the scraper captures search_originating_court_information and extra columns for our usual objects

Implementation details

It's better to look at the code, even if there is still pending work. I have written comments extensively.

https://github.com/freelawproject/juriscraper/pull/952

On Courtlistener: https://github.com/freelawproject/courtlistener/pull/3864

Besides the "code" code review, I will need some "data" code review, to see if I am using properly the nature_of_suit, cause, opinion.type, etc fields

Of note, I found a way to keep tests of secondary/deferred page's examples. For tex it was as simple as tweaking the href leading to the secondary page, so that it points to the precise example file.

Pending work

I still have a bunch of bugs to solve and tests to write for this to be mergeable

writing tests for the JSON Validator (I know it is currently not validating nested objects)
writing a custom type checker for python dates
support deferring attributes
adapting texcrimapp and texapp_* to the new tex class

Further work

There is a clear opportunity to scrape people_db objects, like Person, Party, Attorney, and to support them in cl_scrape_opinions. However, this would take more work and testing since lookups for this objects have to be used

Some bugs found on the way

Bugs on OpinionSite[Linear] integration with CL: Attributes that we can return but are never picked up in CL (defined on OpinionSite class)

            "dispositions",
            "causes",
            "divisions",
            "docket_attachment_numbers",
            "docket_document_numbers",
            "lower_courts",
            "lower_court_judges",
            "lower_court_numbers",

These are actually used on some sources, so we are not inserting data we do collect. For example, lower_courts is used in tenn, nev, ind, bap1, etc

freelawproject / juriscraper