dnrajugade / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

Request: CityHash as a hasher option #889

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Could we get CityHash as an option for the hasher used in Bloomfilters and 
other places: 

http://google-opensource.blogspot.com/2011/04/introducing-cityhash.html
http://code.google.com/p/cityhash/

Original issue reported on code.google.com by emily@soldal.org on 30 Jan 2012 at 11:46

GoogleCodeExporter commented 9 years ago
I like this idea, but I took a look at CityHash a while back and the 
implementation is pretty hairy (they optimized for speed over simplicity).

Surprisingly we haven't had anyone clamoring for this internally yet (I guess 
most people just use the C++ impl), so this will be pretty low priority for us.

Original comment by kak@google.com on 30 Jan 2012 at 5:13

GoogleCodeExporter commented 9 years ago
This is a clear case where we would happily accept a patch if someone's willing 
to go through the (apparently immense) pain of writing and testing it.

Original comment by kevinb@google.com on 30 Jan 2012 at 6:41

GoogleCodeExporter commented 9 years ago
I'd do it if I had some test vectors, but the C++ test provided is somewhat 
undecipherable.

Original comment by emily@soldal.org on 30 Jan 2012 at 6:43

GoogleCodeExporter commented 9 years ago
I suppose we could get some ourselves by just running the C++ version?

Original comment by wasserman.louis on 31 Jan 2012 at 11:55

GoogleCodeExporter commented 9 years ago
I have a patch that's passing the same tests the C++ implementation provides.  
It doesn't include the CRC hashes, but it includes seeded and unseeded 64 and 
128-bit hashes.

It's a very direct translation, I'm afraid.

Original comment by wasserman.louis on 9 Apr 2012 at 11:08

GoogleCodeExporter commented 9 years ago
Fancy! Maybe we can get Geoff Pike or Jyrki Alakuijala to review it for us 
internally...once you've got it cleaned up, mind sending it our way?

Original comment by kak@google.com on 10 Apr 2012 at 1:13

GoogleCodeExporter commented 9 years ago
I don't have much hope that I can actually significantly simplify the 
implementation -- being a hash function, it's pretty deliberately chaotic in 
many ways -- but I am making at least some attempt to optimize the 
implementation, given that Java doesn't exactly have the direct memory access 
advantages of C++, and part of the whole point of CityHash was speed.

In particular, I'm borrowing the approach from 
UnsignedBytes.lexicographicalComparator() to get unsafe memory access with 
sun.misc.Unsafe if possible, and falling back to a pure-Java implementation 
otherwise.  On my machine, that approximately triples the speed of the 
implementation.

I think I'm about ready to send in a CL.  Who should I be sending it to?

Original comment by wasserman.louis on 10 Apr 2012 at 2:23

GoogleCodeExporter commented 9 years ago
You can send it to me :-)  I'll migrate it into google3 and get the right eyes 
on it.

Regarding sun.misc.Unsafe, that's exactly the approach we use internally for 
one of our uber-fast fingerprinting functions...so that sounds great :-)

Original comment by kak@google.com on 10 Apr 2012 at 2:41

GoogleCodeExporter commented 9 years ago
Update: The change is complete and is being migrated internally.

Preliminary indications are that the CityHash64 implementation is ~2x as fast 
as the next-fastest hash function provided in Guava -- murmur3_32 -- although 
since it isn't a "streaming" hash function, it can't be used for goodFastHash.  
Additionally, CityHash128 is ~3-4x faster than murmur3_128.

I believe Kurt is trying to get the original authors of CityHash to take a look 
at the implementation before we export it, but these results are pretty cool!

Original comment by wasserman.louis on 29 Apr 2012 at 5:42

GoogleCodeExporter commented 9 years ago
OK, so there's a small issue that's stalling the review internally: versioning

The CityHash algorithm isn't frozen...meaning it might change in the future.  
Currently they're on v1.0.3 (externally) and have been working towards v1.0.4 
internally.

Right now the proposal is to add these methods to Hashing:
public static HashFunction cityHash103_64()
public static HashFunction cityHash103_64(long seed)
public static HashFunction cityHash103_64(long seed0, long seed1)
public static HashFunction cityHash103_128()
public static HashFunction cityHash103_128(long seed0, long seed1)

However, when CityHash v1.0.4 comes out, we'll need to add *another* *FIVE* 
methods to the Hashing API (and we probably can't even deprecate the existing 
ones since people might have persisted them somewhere!). The algorithm itself 
is unlikely to change much, but there might be small bug fixes (fix some corner 
cases or something) that will mean a whole new minor version number. This whole 
situation kinda stinks :-/

Any thoughts Louis?

Original comment by kurt.kluever on 8 May 2012 at 8:21

GoogleCodeExporter commented 9 years ago
BTW, Geoff Pike said: "I have changes in hand I want to get out there, and that 
should happen this month on the opensource side. After that, changes will be 
driven by bug reports, pretty much."

So I think our plan is to hold off with the Java impl until 1.0.4 is open 
sourced, then get this submitted, and then deal with adding 1.0.5 if and when 
that gets released.

Original comment by kurt.kluever on 8 May 2012 at 8:48

GoogleCodeExporter commented 9 years ago
I almost want to suggest an enum or something?  Blegh.

Alternately, we could just state in the documentation that it might change 
incompatibly in the future, and that currently it's consistent with CityHash 
1.0.3.  Something along the lines of the documentation in goodFastHash.

Original comment by wasserman.louis on 8 May 2012 at 9:34

GoogleCodeExporter commented 9 years ago
Nobody would be able to persist the hash codes then...which isn't good. Let's 
table this for a while until 1.0.4 is out and see if anybody comes up with any 
better ideas in the meantime.

Original comment by kurt.kluever on 8 May 2012 at 10:05

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 30 May 2012 at 7:43

GoogleCodeExporter commented 9 years ago
CityHash 1.1 was released on Oct 22, presumably should be pretty stable by 
now...

Original comment by phraktle on 10 Dec 2012 at 11:45

GoogleCodeExporter commented 9 years ago
Does SIPHash also have a chance of being included?

The top two google search results for "siphash java" both contain the same bug 
(they sign-extend input bytes when they shouldn't.)  Having it in a more 
popular library such as Guava might save some developers from being caught by 
this.

Original comment by fin...@gmail.com on 15 Dec 2012 at 3:35

GoogleCodeExporter commented 9 years ago
@finnw1: can you please file a separate issue for SIPHash with some relevant 
links? Thanks :-)

Original comment by kak@google.com on 15 Dec 2012 at 11:50

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:14

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:18

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08