Issue 147, 154 - Improve citation object hashing behavior

mattdahl commented 1 year ago

This PR does three things: 1) Citations with normalized/corrected reporters (e.g., 1 U.S. 1 versus 1 U. S. 1) are now treated as equal (#147) 2) Citations with nominative reporters (e.g., 5 U.S. 137 versus 5 U.S. (1 Cranch) 137 are now treated as equal (#154) 3) Citation hashes are now reproducible/deterministic across runs (https://github.com/freelawproject/eyecite/issues/147#issuecomment-1488381874)

In terms of implementation, here's what's changed:

The old comparison_hash() method that a few classes had has been removed entirely. This was originally intended to essentially be a private method for Resource objects to call, but the name was confusing and could have been misinterpreted by a causal user as a public method that should be used for general comparisons. Going forward, anyone should simply use the built-in hash() method for everything, which is the most natural approach anyway.
CitationBase classes and their subclasses now all use the @dataclass(eq=False, unsafe_hash=False) decorator. Previously both eq and unsafe_hash were set to True. As the keyword argument implies, this was somewhat... unsafe. Or at least confusing. See here for a detailed discussion explaining how dataclass objects automatically try to create hash functions when these arguments are True. I have changed this so none of this implicit stuff is happening anymore; instead, we explicitly define __eq__() and __hash__() ourselves on the CitationBase class, which creates a clear record of how the hashing is actually happening. (These methods are then inherited by the citation subclasses, though ResourceCitation and CaseCitation classes override the __hash__() method in order to maintain conformity with existing behavior for those special cases.
To create persistent hashes, I use hashlib.sha256. However, the built-in hash() function will only output hashes that are 32 bits long, so I truncate the sha256 hash using c_int32() from the ctypes standard library. By doing this truncation ourselves, we ensure that calling hash(obj) and obj.__hash__() both return the same hash (for a given object).
This behavior was not previously well-defined, but I have made it so IdCitation and UnknownCitation objects are sui generis. Since we don't really know anything about what these objects point to, I think it would be dangerous to ever hash two of them the same.

If merged, this PR supersedes #148. Sorry again @overmode for the previous confusion, this turned out to be quite complicated. I have used some of your logic and tests in this PR, and it should achieve all the functionality you were looking for.

Feedback welcome!

mattdahl commented 1 year ago

(It looks like the benchmark action is not working again.)

mlissner commented 1 year ago

@flooie can you please help get this reviewed? @overmode, I think you might want to take a look too, even if you only glance, since I think it's fixing several bugs you reported.

overmode commented 1 year ago

Awesome, thanks for implementing the changes and notifying me.

It has been a while since last time I took a look at eyecite's source code but the PR looks good to me. You might want to document, for each class, what keys are in the groups, as it would make the comparison more transparent.

My colleague @khakhlyuk will also take a look, he is currently working with eyecite.

khakhlyuk commented 1 year ago

great work, thank you for implementing this, very useful and very much needed!

The code looks good to me.

The only issue I see is using 32bits for hashes. I am working with a dataset that contains up to 10M unique citations. With a 32bit hash and a dataset of this size hash collisions are inevitable. Even for 100k citations, the probability of a single hash collision is 69%. I've calculated the probability here: https://kevingal.com/apps/collision.html

Is python's default hash() output of 32bits a limitation? I have several thoughts on this. 1) hash() returns 64-bit ints on 64-bit platforms anyway 2) i have tried using larger ints as output of a hash function and it works. I have experimented with c_int64 and even using the 256-bit hash directly - int.from_bytes(hashlib.sha256(json_bytes).digest(), byteorder="big"). I have used Citation objects with 64-bit and 256-bit hash as keys for dict and set, everything work fine. 3) afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Is there another reason to limit output to a 32-bit int? If not, it would be really nice to change the hash to 64bit ints in my opinion.

mattdahl commented 1 year ago

3. afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Ah, this is interesting! I did not realize, but for numeric types, you are correct that it just performs a modulo reduction: https://github.com/python/cpython/blob/main/Python/pyhash.c#L34

In that case, I will remove the truncation logic and let hash() do its own thing. The only problem I foresee with this approach is that the hashes will not be the same across 32 and 64 bit machines (because of the way the modulo is implemented), but I don't think that's a serious concern.

Thanks for pointing this out.

mattdahl commented 1 year ago

All right, 0a30559 removes the truncation, exploiting the fact that calling hash() twice on an arbitrarily large integer (e.g., hash(hash(123))) guarantees reproducibility (because of that modulo trick). Let me know what you think! (This means that the hashes should be 64-bit on 64-bit machines.)

mattdahl commented 1 year ago

@overmode I agree that adding documentation of the expected group keys would be good, I'll get to that soon.

mlissner commented 1 year ago

Thanks all. @flooie final review rests with you, but if the other folks on this thread want to take another look at the latest commits first, I think that'd be great. Thank you all!

khakhlyuk commented 1 year ago

i like the trick with applying an additional hash(). Good solution! LGTM

mattdahl commented 1 year ago

I added some documentation spelling out the expected content of self.groups for the different citation classes, as suggested by @overmode . As I was doing this, however, I realized that the volume key is not guaranteed to exist for CaseCitation objects. (The reporter and page keys are, at least they should be, because of a test we have in reporters_db. This is also reflected in the new documentation.) Thus, my previous code would crash for e.g. certain tax court citations. I've fixed that bug now too.

mattdahl commented 1 year ago

Comrade @flooie, just wanted to bump this if you have a chance to review and merge.

freelawproject / eyecite

Issue 147, 154 - Improve citation object hashing behavior #155