Closed mattdahl closed 1 year ago
(It looks like the benchmark action is not working again.)
@flooie can you please help get this reviewed? @overmode, I think you might want to take a look too, even if you only glance, since I think it's fixing several bugs you reported.
Awesome, thanks for implementing the changes and notifying me.
It has been a while since last time I took a look at eyecite's source code but the PR looks good to me. You might want to document, for each class, what keys are in the groups, as it would make the comparison more transparent.
My colleague @khakhlyuk will also take a look, he is currently working with eyecite.
great work, thank you for implementing this, very useful and very much needed!
The code looks good to me.
The only issue I see is using 32bits for hashes. I am working with a dataset that contains up to 10M unique citations. With a 32bit hash and a dataset of this size hash collisions are inevitable. Even for 100k citations, the probability of a single hash collision is 69%. I've calculated the probability here: https://kevingal.com/apps/collision.html
Is python's default hash()
output of 32bits a limitation? I have several thoughts on this.
1) hash()
returns 64-bit ints on 64-bit platforms anyway
2) i have tried using larger ints as output of a hash function and it works. I have experimented with c_int64
and even using the 256-bit hash directly - int.from_bytes(hashlib.sha256(json_bytes).digest(), byteorder="big")
. I have used Citation objects with 64-bit and 256-bit hash as keys for dict and set, everything work fine.
3) afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.
Is there another reason to limit output to a 32-bit int? If not, it would be really nice to change the hash to 64bit ints in my opinion.
3. afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.
Ah, this is interesting! I did not realize, but for numeric types, you are correct that it just performs a modulo reduction: https://github.com/python/cpython/blob/main/Python/pyhash.c#L34
In that case, I will remove the truncation logic and let hash()
do its own thing. The only problem I foresee with this approach is that the hashes will not be the same across 32 and 64 bit machines (because of the way the modulo is implemented), but I don't think that's a serious concern.
Thanks for pointing this out.
All right, 0a30559 removes the truncation, exploiting the fact that calling hash()
twice on an arbitrarily large integer (e.g., hash(hash(123))
) guarantees reproducibility (because of that modulo trick). Let me know what you think! (This means that the hashes should be 64-bit on 64-bit machines.)
@overmode I agree that adding documentation of the expected group keys would be good, I'll get to that soon.
Thanks all. @flooie final review rests with you, but if the other folks on this thread want to take another look at the latest commits first, I think that'd be great. Thank you all!
i like the trick with applying an additional hash(). Good solution! LGTM
I added some documentation spelling out the expected content of self.groups
for the different citation classes, as suggested by @overmode . As I was doing this, however, I realized that the volume
key is not guaranteed to exist for CaseCitation
objects. (The reporter
and page
keys are, at least they should be, because of a test we have in reporters_db
. This is also reflected in the new documentation.) Thus, my previous code would crash for e.g. certain tax court citations. I've fixed that bug now too.
Comrade @flooie, just wanted to bump this if you have a chance to review and merge.
This PR does three things: 1) Citations with normalized/corrected reporters (e.g.,
1 U.S. 1
versus1 U. S. 1
) are now treated as equal (#147) 2) Citations with nominative reporters (e.g.,5 U.S. 137
versus5 U.S. (1 Cranch) 137
are now treated as equal (#154) 3) Citation hashes are now reproducible/deterministic across runs (https://github.com/freelawproject/eyecite/issues/147#issuecomment-1488381874)In terms of implementation, here's what's changed:
comparison_hash()
method that a few classes had has been removed entirely. This was originally intended to essentially be a private method forResource
objects to call, but the name was confusing and could have been misinterpreted by a causal user as a public method that should be used for general comparisons. Going forward, anyone should simply use the built-inhash()
method for everything, which is the most natural approach anyway.CitationBase
classes and their subclasses now all use the@dataclass(eq=False, unsafe_hash=False)
decorator. Previously botheq
andunsafe_hash
were set toTrue
. As the keyword argument implies, this was somewhat... unsafe. Or at least confusing. See here for a detailed discussion explaining how dataclass objects automatically try to create hash functions when these arguments areTrue
. I have changed this so none of this implicit stuff is happening anymore; instead, we explicitly define__eq__()
and__hash__()
ourselves on theCitationBase
class, which creates a clear record of how the hashing is actually happening. (These methods are then inherited by the citation subclasses, thoughResourceCitation
andCaseCitation
classes override the__hash__()
method in order to maintain conformity with existing behavior for those special cases.hashlib.sha256
. However, the built-inhash()
function will only output hashes that are 32 bits long, so I truncate the sha256 hash usingc_int32()
from thectypes
standard library. By doing this truncation ourselves, we ensure that callinghash(obj)
andobj.__hash__()
both return the same hash (for a given object).IdCitation
andUnknownCitation
objects are sui generis. Since we don't really know anything about what these objects point to, I think it would be dangerous to ever hash two of them the same.If merged, this PR supersedes #148. Sorry again @overmode for the previous confusion, this turned out to be quite complicated. I have used some of your logic and tests in this PR, and it should achieve all the functionality you were looking for.
Feedback welcome!