freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
114 stars 27 forks source link

FullCaseCitation.group have badly formatted reporter #147

Open overmode opened 1 year ago

overmode commented 1 year ago

Hey, Thanks again for the library, we're using it everyday :)

We're currently trying to come up with a unique string representation for citations, one that would be unique for all citations being semantically the same, but would still be human-readable.

We've tried to use the corrected_citation(), but it fails in some cases, here are a few examples where the output is the same as matched_text() : 5 U.S.C. §702, 5 U.S.C. § 702, 5 U.S.C. §§702, 5 U.S.C. §702 Note that the corrected_citation_full function also fails in mapping these examples to the same string.

Because these limitations seem to be hard to fix, we gave up on the human-readable requirement and just want to have a unique representation, so exactly what comparison_hash does.

This issue is about fixing a corner case that we spotted :

Cit1 : FullCaseCitation('506 U.S. 534', groups={'volume': '506', 'reporter': 'U.S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='539', year='1993', court='scotus', plaintiff='STATES', defendant='STINSONStates', extra=None))
Cit2 : FullCaseCitation('506 U. S. 534', groups={'volume': '506', 'reporter': 'U. S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff='STATES', defendant='STINSONStates', extra=None))]
Cit1.corrected_citation() : 506 U.S. 534
Cit2.corrected_citation() : 506 U.S. 534

Here the corrected citation for the second string has the corrected reporter, but this was not propagated to the groups attribute, thus the comparison_hash is different. Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

I'm happy to open a PR if you think it makes sense

mlissner commented 1 year ago

Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

I'm actually not sure how groups is used these days, so I'm not sure how to respond to this without looking at the code. If groups is used to re-create the linkified text, then you don't want corrections in there. We did that once years ago, and it was a mistake because sometimes the corrections are imperfect, and that can be a real problem if you take a good citation with a bad reporter and convert it to something else (with a valid, but wrong reporter).

Can you remind me how groups is used? Depending on that, it's either working correctly, or broken, like you say.

overmode commented 1 year ago
mattdahl commented 1 year ago

Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

The Citation.groups dict is set on init to the content of the Token.groups dict, which is itself generated from calling groupdict() on the matched regex object during parsing. (Hence the variable name "groups" bubbles up, though it is a bit of a misnomer at the Citation object level.) So there is no way to "correct" the reporter prior to writing it into the groups dict; indeed, the correction process presupposes that the matched reporter key is already in the groups dict.

Would it work instead for you to change how CitationBase objects are hashed here? https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L117

I.e., instead of making a tuple of self.groups.items(), make a tuple of self.groups['volume'], self.corrected_reporter(), self.groups['page']? N.b. this might have to be overridden at the ResourceCitation level instead because not all citation object types have a corrected_reporter() method.

overmode commented 1 year ago

@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the __post_init__ function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go

mlissner commented 1 year ago

I'm over my head and explicitly bowing out, if you guys have a sense of the direction to go, but if the conversation goes stale or you need a BDFL, please feel free to flag it for me and I can try to understand what y'all are talking about. :)

mattdahl commented 1 year ago

@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the post_init function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go

Yeah, I agree it would be technically possible there, but the idea of the groups dict is to store the actual results from calling m.groupdict(). I can't give a specific example right now of why we need to retain this, but conceptually it seems dangerous to modify the dict contents post-hoc to be something that wasn't actually matched from the regex. So I would suggest changing the hash function instead, if that's all right with you!

overmode commented 1 year ago

Noted 👍 By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?

overmode commented 1 year ago

Created the PR

overmode commented 1 year ago

Whoops, it seems like we have another problem with the comparison hash. My colleague @alessandro-ciffo found that the comparison hash changes every time you run a new script.

The following code :

frm eyecite import get_citations
cit_str = "482 S.E.2d 805"
cit = get_citations(cit_str)[0]
print(cit.comparison_hash())

Will print a new value every time you run it, because python adds randomness to the hash function (as stated here).

What do you think of using the hashlib library instead ? We could use something like

from hashlib import sha1
import json
def comparison_hash(self) :
    tup = (str(type(self)), tuple(self.groups["volume"], self.corrected_reporter(), self.groups["page"]))
    return sha1(json.dumps(tup).encode('utf-8'))

(Probably not a valid syntax but you get the idea)

mattdahl commented 1 year ago

Ah, interesting! It does seem to make sense to me that the hashes should be reproducible across runs. The hashlib module (part of the standard library, right?) seems reasonable to me, but I would summon @mlissner back for this

mattdahl commented 1 year ago

By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?

You mean the FullLawCitation corner cases with 5 U.S.C. §702, 5 U.S.C. § 702, 5 U.S.C. §§702, 5 U.S.C. §702, etc.? I agree that it does seem like those should all be normalized to a common string (rather than just returning the result of matched_text()). The spacing issue could be fixed easily, but the double § symbol might be more complicated. We would have to make sure that we only normalized §§ to § in the case where there is indeed only one section listed (as in the example). In any case, the trick would be to override the corrected_citation() method on the FullLawCitation class here: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L280

mlissner commented 1 year ago

Sure, using the hashlib seems fine to me. We have some simple wrappers for it in CL:

https://github.com/freelawproject/courtlistener/blob/main/cl/lib/crypto.py

I think I'd suggest SHA256 for this, though md5 would produce shorter hashes and be negligibly faster. It's less secure, but that doesn't matter here.