Document changes and enhancements made to Harvard data during import

mlissner commented 3 months ago

@quevon24 and @flooie, we're working with folks from Harvard (and others) to bring our system up to parity with theirs.

A big question that has come up several times is: What did we do to enhance/change/modify/etc the Harvard data while importing it.

Is it possible to document that here so that we can merge our changes in with those Harvard has recently made?

flooie commented 3 months ago

@mlissner

All of the changes to the source data are in this repository. They generally can be categorized.

Small OCR mistakes - wrong dates / typos
Citation Fixes etc.

Structural Fixes

These take the form of wrapping the small opinion in an opinion tag- we found lots of empty opinions

Updating the tags - we found that lots of opinions were not correctly wrapped as opinions - often the opinion would start I the headmatter or opinion content like concurrences would not be identified as opinions and would just be

tags in-between majority and dissents. This is mostly using an ML model to make a good guess what something should be so we could properly import it.

PR 54 we addressed some footnotes issues where footnote text was disconnected from the opinion and did not follow what was standard practice. Where we could identify them we reconnected them.

I believe we also linked directly to case.law for cases where CaseLaw indicate that no opinion was found so we added a link to the pdf so users could see them themselves.

mlissner commented 3 months ago

Very helpful, thanks Bill. So I think the changes that the folks at LIL made are all to the body of the case. Can you comment on which of our fixes could intersect with that, if any? And if some do, can you say whether you're confident that we would have our fixes in this repo?

Looking at the things you posted above, I'm thinking maybe, just maybe, this is easier than we think, if LIL made fixes to one part of the JSON, and we made them to another.

freelawproject / opinionated

Document changes and enhancements made to Harvard data during import #61