freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
123 stars 32 forks source link

Court name issues #129

Open bbernicker opened 2 years ago

bbernicker commented 2 years ago

This PR addresses #128. The previous code returned the court ID of the first court for which the courts-db cite string started with the court abbreviation eyecite detected in the parenthetical. There was a comment on that code explaining that it used startswith instead of requiring a match because the court abbreviations from eyecite were often missing final punctuation.

This PR looks for courts in courts-db for which the cite string, with or without its final character, matches. This solution is not perfect because the citation_string field in courts-db has many duplicates: for 1981 courts, there are only 1174 unique citation strings. On the other hand, allowing matches without regard to the last character does not seem to make the problem worse, as there are only two citation strings which are the same but for their last character.

Ultimately, ensuring that Eyecite can correctly determine the court which rendered an opinion from a citation will require cleaning up courts-db so that duplicate citation_strings can be disambiguated using other info (such as reporter) and so that alternate citation_strings are permitted for courts (see freelawproject/courts-db#53). We should also add a more robust system for determining the rendering court from the reporter (e.g. with vendor neutral citations) or from neighboring parallel cites (esp. any in the metadata.extras field).

Finally, I added some tests to check that Eyecite is handling court names reasonably well.

bbernicker commented 2 years ago

I ran the tests natively and they all passed. Most of the commit checks that are failing seem to be due to my linter allowing slightly wider lines than your linter (a problem I have noticed while commiting to other FLP projects, and that I should probably address), but I don't know what about this commit could be affecting fast_diff_match_patch and causing build 3.10 to fail.

flooie commented 2 years ago

I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.

mlissner commented 2 years ago

@flooie can you also opine about how this bug affects CourtListener itself?

bbernicker commented 2 years ago

I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.

If there is anything I can do to help, I'd be happy to jump on a call to talk though it.

flooie commented 2 years ago

@bbernicker are you in our slack channel yet?

bbernicker commented 2 years ago

@bbernicker are you in our slack channel yet?

No not yet.

mlissner commented 2 years ago

Just sent an invite, @bbernicker.

bbernicker commented 1 year ago

Any updates on this PR @flooie?

mattdahl commented 1 year ago

@bbernicker I was just reviewing this PR in light of #144 and the changes made to courts-db in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.

Right now, we use startswith() to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g., 2d. Cir in an parsed citation should match 2d. Cir. from courts-db.

This is problematic, as you point out, because something like Tex. will match any number of longer strings from courts-db, e.g., Tex. Rev., Tex. Rev. Trib., etc., in addition to the simple Tex. court that it should match solely. So you propose replacing startswith() with a check for exact equality first, and, in the alternative, equality minus the final character.

This solves the Tex.-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we want Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992) to be linked to the pasuperct slug, but the citation string for that slug is Pa. Super. Ct.! Since the two strings differ by more than just a missing period (one is missing the full Ct. suffix), they will not be matched under your code change, though they would be under the existing greedy startswith() approach.

I still think getting rid of startswith() is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then use startswith() only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!

bbernicker commented 1 year ago

@bbernicker I was just reviewing this PR in light of #144 and the changes made to courts-db in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.

Right now, we use startswith() to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g., 2d. Cir in an parsed citation should match 2d. Cir. from courts-db.

This is problematic, as you point out, because something like Tex. will match any number of longer strings from courts-db, e.g., Tex. Rev., Tex. Rev. Trib., etc., in addition to the simple Tex. court that it should match solely. So you propose replacing startswith() with a check for exact equality first, and, in the alternative, equality minus the final character.

This solves the Tex.-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we want Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992) to be linked to the pasuperct slug, but the citation string for that slug is Pa. Super. Ct.! Since the two strings differ by more than just a missing period (one is missing the full Ct. suffix), they will not be matched under your code change, though they would be under the existing greedy startswith() approach.

I still think getting rid of startswith() is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then use startswith() only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!

Thanks Matt. This is really helpful. Let me give this problem some thought.