freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
118 stars 29 forks source link

Issue 147, 154 - Improve citation object hashing behavior #155

Closed mattdahl closed 1 year ago

mattdahl commented 1 year ago

This PR does three things: 1) Citations with normalized/corrected reporters (e.g., 1 U.S. 1 versus 1 U. S. 1) are now treated as equal (#147) 2) Citations with nominative reporters (e.g., 5 U.S. 137 versus 5 U.S. (1 Cranch) 137 are now treated as equal (#154) 3) Citation hashes are now reproducible/deterministic across runs (https://github.com/freelawproject/eyecite/issues/147#issuecomment-1488381874)

In terms of implementation, here's what's changed:

If merged, this PR supersedes #148. Sorry again @overmode for the previous confusion, this turned out to be quite complicated. I have used some of your logic and tests in this PR, and it should achieve all the functionality you were looking for.

Feedback welcome!

mattdahl commented 1 year ago

(It looks like the benchmark action is not working again.)

mlissner commented 1 year ago

@flooie can you please help get this reviewed? @overmode, I think you might want to take a look too, even if you only glance, since I think it's fixing several bugs you reported.

overmode commented 1 year ago

Awesome, thanks for implementing the changes and notifying me.

It has been a while since last time I took a look at eyecite's source code but the PR looks good to me. You might want to document, for each class, what keys are in the groups, as it would make the comparison more transparent.

My colleague @khakhlyuk will also take a look, he is currently working with eyecite.

khakhlyuk commented 1 year ago

great work, thank you for implementing this, very useful and very much needed!

The code looks good to me.

The only issue I see is using 32bits for hashes. I am working with a dataset that contains up to 10M unique citations. With a 32bit hash and a dataset of this size hash collisions are inevitable. Even for 100k citations, the probability of a single hash collision is 69%. I've calculated the probability here: https://kevingal.com/apps/collision.html

Is python's default hash() output of 32bits a limitation? I have several thoughts on this. 1) hash() returns 64-bit ints on 64-bit platforms anyway 2) i have tried using larger ints as output of a hash function and it works. I have experimented with c_int64 and even using the 256-bit hash directly - int.from_bytes(hashlib.sha256(json_bytes).digest(), byteorder="big"). I have used Citation objects with 64-bit and 256-bit hash as keys for dict and set, everything work fine. 3) afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Is there another reason to limit output to a 32-bit int? If not, it would be really nice to change the hash to 64bit ints in my opinion.

mattdahl commented 1 year ago
3. afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Ah, this is interesting! I did not realize, but for numeric types, you are correct that it just performs a modulo reduction: https://github.com/python/cpython/blob/main/Python/pyhash.c#L34

In that case, I will remove the truncation logic and let hash() do its own thing. The only problem I foresee with this approach is that the hashes will not be the same across 32 and 64 bit machines (because of the way the modulo is implemented), but I don't think that's a serious concern.

Thanks for pointing this out.

mattdahl commented 1 year ago

All right, 0a30559 removes the truncation, exploiting the fact that calling hash() twice on an arbitrarily large integer (e.g., hash(hash(123))) guarantees reproducibility (because of that modulo trick). Let me know what you think! (This means that the hashes should be 64-bit on 64-bit machines.)

mattdahl commented 1 year ago

@overmode I agree that adding documentation of the expected group keys would be good, I'll get to that soon.

mlissner commented 1 year ago

Thanks all. @flooie final review rests with you, but if the other folks on this thread want to take another look at the latest commits first, I think that'd be great. Thank you all!

khakhlyuk commented 1 year ago

i like the trick with applying an additional hash(). Good solution! LGTM

mattdahl commented 1 year ago

I added some documentation spelling out the expected content of self.groups for the different citation classes, as suggested by @overmode . As I was doing this, however, I realized that the volume key is not guaranteed to exist for CaseCitation objects. (The reporter and page keys are, at least they should be, because of a test we have in reporters_db. This is also reflected in the new documentation.) Thus, my previous code would crash for e.g. certain tax court citations. I've fixed that bug now too.

mattdahl commented 1 year ago

Comrade @flooie, just wanted to bump this if you have a chance to review and merge.