Open overmode opened 1 year ago
Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.
I'm actually not sure how groups is used these days, so I'm not sure how to respond to this without looking at the code. If groups is used to re-create the linkified text, then you don't want corrections in there. We did that once years ago, and it was a mistake because sometimes the corrections are imperfect, and that can be a real problem if you take a good citation with a bad reporter and convert it to something else (with a valid, but wrong reporter).
Can you remind me how groups is used? Depending on that, it's either working correctly, or broken, like you say.
Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.
The Citation.groups
dict is set on init to the content of the Token.groups
dict, which is itself generated from calling groupdict()
on the matched regex object during parsing. (Hence the variable name "groups" bubbles up, though it is a bit of a misnomer at the Citation
object level.) So there is no way to "correct" the reporter prior to writing it into the groups dict; indeed, the correction process presupposes that the matched reporter
key is already in the groups dict.
Would it work instead for you to change how CitationBase
objects are hashed here? https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L117
I.e., instead of making a tuple of self.groups.items()
, make a tuple of self.groups['volume'], self.corrected_reporter(), self.groups['page']
? N.b. this might have to be overridden at the ResourceCitation
level instead because not all citation object types have a corrected_reporter()
method.
@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the __post_init__ function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go
I'm over my head and explicitly bowing out, if you guys have a sense of the direction to go, but if the conversation goes stale or you need a BDFL, please feel free to flag it for me and I can try to understand what y'all are talking about. :)
@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the post_init function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go
Yeah, I agree it would be technically possible there, but the idea of the groups
dict is to store the actual results from calling m.groupdict()
. I can't give a specific example right now of why we need to retain this, but conceptually it seems dangerous to modify the dict contents post-hoc to be something that wasn't actually matched from the regex. So I would suggest changing the hash function instead, if that's all right with you!
Noted 👍 By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?
Created the PR
Whoops, it seems like we have another problem with the comparison hash. My colleague @alessandro-ciffo found that the comparison hash changes every time you run a new script.
The following code :
frm eyecite import get_citations
cit_str = "482 S.E.2d 805"
cit = get_citations(cit_str)[0]
print(cit.comparison_hash())
Will print a new value every time you run it, because python adds randomness to the hash function (as stated here).
What do you think of using the hashlib library instead ? We could use something like
from hashlib import sha1
import json
def comparison_hash(self) :
tup = (str(type(self)), tuple(self.groups["volume"], self.corrected_reporter(), self.groups["page"]))
return sha1(json.dumps(tup).encode('utf-8'))
(Probably not a valid syntax but you get the idea)
Ah, interesting! It does seem to make sense to me that the hashes should be reproducible across runs. The hashlib
module (part of the standard library, right?) seems reasonable to me, but I would summon @mlissner back for this
By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?
You mean the FullLawCitation
corner cases with 5 U.S.C. §702
, 5 U.S.C. § 702
, 5 U.S.C. §§702
, 5 U.S.C. §702
, etc.? I agree that it does seem like those should all be normalized to a common string (rather than just returning the result of matched_text()
). The spacing issue could be fixed easily, but the double § symbol might be more complicated. We would have to make sure that we only normalized §§ to § in the case where there is indeed only one section listed (as in the example). In any case, the trick would be to override the corrected_citation()
method on the FullLawCitation
class here: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L280
Sure, using the hashlib seems fine to me. We have some simple wrappers for it in CL:
https://github.com/freelawproject/courtlistener/blob/main/cl/lib/crypto.py
I think I'd suggest SHA256 for this, though md5 would produce shorter hashes and be negligibly faster. It's less secure, but that doesn't matter here.
Hey, Thanks again for the library, we're using it everyday :)
We're currently trying to come up with a unique string representation for citations, one that would be unique for all citations being semantically the same, but would still be human-readable.
We've tried to use the
corrected_citation()
, but it fails in some cases, here are a few examples where the output is the same asmatched_text()
:5 U.S.C. §702
,5 U.S.C. § 702
,5 U.S.C. §§702
,5 U.S.C. §702
Note that thecorrected_citation_full
function also fails in mapping these examples to the same string.Because these limitations seem to be hard to fix, we gave up on the human-readable requirement and just want to have a unique representation, so exactly what
comparison_hash
does.This issue is about fixing a corner case that we spotted :
Here the corrected citation for the second string has the corrected reporter, but this was not propagated to the groups attribute, thus the comparison_hash is different. Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.
I'm happy to open a PR if you think it makes sense