apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

Improve hash mixing in FST's double-barrel LRU hash #12704

Open mikemccand opened 1 year ago

mikemccand commented 1 year ago

Description

Spinoff from this cool comment, thanks to hashing guru @bruno-roustant:

Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public).
This really makes a difference in the evenness of the value distribution. This is one of the secrets of the HPPC hashing. By applying this, we get multiple advantages:

  * lookup should be improved (less hash collision)
  * we can try to rehash at 3/4 occupancy because the performance should not be impacted until this point.
  * in case of hash collision, we can lookup linearly with a pos = pos + 1 instead of quadratic probe (lines 95 and 327); this may avoid some mem cache miss.
  * 
(same for the other hash method)

This is a simple change, we just need to test on some real FST building cases to confirm good mixing "in practice". The new IndexToFST tool in luceneutil is helpful for this.

bruno-roustant commented 1 year ago

@dweiss will probably say more than me about the awesome BitMixer#PHI_C64 constant!

dweiss commented 1 year ago

I borrowed that constant in BitMixer from Sebastiano Vigna, I believe. Here is a nice overview of its origin/ rationale:

https://softwareengineering.stackexchange.com/questions/402542/where-do-magic-hashing-constants-like-0x9e3779b9-and-0x9e3779b1-come-from

I can only confirm that a good hash redistribution function, along with linear probing, give very good results in most hash/index redistribution problems I've seen.

shubhamvishu commented 12 months ago

I opened a PR to make use of this constant : #12716

Also, I was thinking if this constant could be utilised in other hash function implementations as well in the codebase?

dweiss commented 12 months ago

If you'd like to do so, I'd suggest moving such a "scattering remix" utility to a separate class and reusing it elsewhere, much like here: https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java#L96-L99

Makes it easier to change the remixing strategy everywhere at once.