Open bbernicker opened 2 years ago
I ran the tests natively and they all passed. Most of the commit checks that are failing seem to be due to my linter allowing slightly wider lines than your linter (a problem I have noticed while commiting to other FLP projects, and that I should probably address), but I don't know what about this commit could be affecting fast_diff_match_patch and causing build 3.10 to fail.
I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.
@flooie can you also opine about how this bug affects CourtListener itself?
I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.
If there is anything I can do to help, I'd be happy to jump on a call to talk though it.
@bbernicker are you in our slack channel yet?
@bbernicker are you in our slack channel yet?
No not yet.
Just sent an invite, @bbernicker.
Any updates on this PR @flooie?
@bbernicker I was just reviewing this PR in light of #144 and the changes made to courts-db
in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.
Right now, we use startswith()
to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g., 2d. Cir
in an parsed citation should match 2d. Cir.
from courts-db
.
This is problematic, as you point out, because something like Tex.
will match any number of longer strings from courts-db
, e.g., Tex. Rev.
, Tex. Rev. Trib.
, etc., in addition to the simple Tex.
court that it should match solely. So you propose replacing startswith()
with a check for exact equality first, and, in the alternative, equality minus the final character.
This solves the Tex.
-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we want Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992)
to be linked to the pasuperct
slug, but the citation string for that slug is Pa. Super. Ct.
! Since the two strings differ by more than just a missing period (one is missing the full Ct.
suffix), they will not be matched under your code change, though they would be under the existing greedy startswith()
approach.
I still think getting rid of startswith()
is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then use startswith()
only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!
@bbernicker I was just reviewing this PR in light of #144 and the changes made to
courts-db
in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.Right now, we use
startswith()
to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g.,2d. Cir
in an parsed citation should match2d. Cir.
fromcourts-db
.This is problematic, as you point out, because something like
Tex.
will match any number of longer strings fromcourts-db
, e.g.,Tex. Rev.
,Tex. Rev. Trib.
, etc., in addition to the simpleTex.
court that it should match solely. So you propose replacingstartswith()
with a check for exact equality first, and, in the alternative, equality minus the final character.This solves the
Tex.
-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we wantCommonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992)
to be linked to thepasuperct
slug, but the citation string for that slug isPa. Super. Ct.
! Since the two strings differ by more than just a missing period (one is missing the fullCt.
suffix), they will not be matched under your code change, though they would be under the existing greedystartswith()
approach.I still think getting rid of
startswith()
is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then usestartswith()
only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!
Thanks Matt. This is really helpful. Let me give this problem some thought.
This PR addresses #128. The previous code returned the court ID of the first court for which the courts-db cite string started with the court abbreviation eyecite detected in the parenthetical. There was a comment on that code explaining that it used startswith instead of requiring a match because the court abbreviations from eyecite were often missing final punctuation.
This PR looks for courts in courts-db for which the cite string, with or without its final character, matches. This solution is not perfect because the citation_string field in courts-db has many duplicates: for 1981 courts, there are only 1174 unique citation strings. On the other hand, allowing matches without regard to the last character does not seem to make the problem worse, as there are only two citation strings which are the same but for their last character.
Ultimately, ensuring that Eyecite can correctly determine the court which rendered an opinion from a citation will require cleaning up courts-db so that duplicate citation_strings can be disambiguated using other info (such as reporter) and so that alternate citation_strings are permitted for courts (see freelawproject/courts-db#53). We should also add a more robust system for determining the rendering court from the reporter (e.g. with vendor neutral citations) or from neighboring parallel cites (esp. any in the metadata.extras field).
Finally, I added some tests to check that Eyecite is handling court names reasonably well.