freelawproject / opinionated

BSD 2-Clause "Simplified" License
6 stars 1 forks source link

~47,000 broken opinions #34

Closed flooie closed 2 years ago

flooie commented 2 years ago

We have an issue with the case.law data being poorly parsed, overly redacted or something else for around 46,579 opinions.

Unfortunately for roughly 1,226 opinions we have identified as actually not having an opinion.
They seem to form a larger grouping of missing opinions - overly redacted or other odd ball opinions.

Normally these are one line opinions, bunched up on a page.

This leaves the vast majority of around 44,125 opinions that were malformed. In many of these opinions the full opinion is simply, "Vacated" "Remanded" "Case Dismissed". But as is common there was no clear pattern or established "bad xml" that we could simple reverse.

To handle the task of identifying the missing or hidden opinion, I trained a Maximum Entropy text classifier ML model using createml- and keyed 12 categories based on good data we had from the harvard data set. See the distribution of data below.

The training and testing dataset was generated from a random sample of 650 opinions, extracting out all of the fields (excluding sub tags like br, strong, em, extracted-citation). 650 was the number required for a sample this size to ensure 99% confidence level and a 5% margin of error.

mlmodeltrainingset

This generated a training set that after it was parsed over - identified roughly 1000 tags that I deemed were opinions. Now that number is larger than 650 because opinions sometimes were spread out over multiple

tags or multiple attorney tags etc. I also slowly added false negatives as I reviewed and improved the training set.

This method was effective but not quite as accurate as I would've liked. Roughly 93% validation accuracy.

After feeding back bad results back into the training set, I switched to a Transfer learning, Dynamic Embedding text classifier. This eventually increased the validation accuracy to > 99.1%

Screenshot 2022-10-10 at 4 50 31 PM

In actuality, this was closer to 99.9% accurate when identifying just opinions. I have yet to see a failed opinion in reviewing roughly 1000 generated html files.

With our ML model at hand, It was relatively easy to move all opinion data into the opinions and identify the actually failing opinions that I mentioned at the start. simply by taking the tags identified as opinion data into the empty opinion tags in the bad case.law data.

@mlissner

We still have a good portion roughly 2% of this final push that contains no opinions. I added a [NO OPINION] text to these opinions and included them to this push, but I would like your thoughts on this decision.

Of Note: This was trained via CreateML, an Apple/Linux only ecosystem for training and Apple only for generating results.

flooie commented 2 years ago

I reworked the training model to a binary opinion/not opinion. and these were my results

Screenshot 2022-10-11 at 8 48 31 PM

not the 99.9% we want but 99.6 percent validation data set