Closed GoogleCodeExporter closed 9 years ago
I like this idea, but I took a look at CityHash a while back and the
implementation is pretty hairy (they optimized for speed over simplicity).
Surprisingly we haven't had anyone clamoring for this internally yet (I guess
most people just use the C++ impl), so this will be pretty low priority for us.
Original comment by kak@google.com
on 30 Jan 2012 at 5:13
This is a clear case where we would happily accept a patch if someone's willing
to go through the (apparently immense) pain of writing and testing it.
Original comment by kevinb@google.com
on 30 Jan 2012 at 6:41
I'd do it if I had some test vectors, but the C++ test provided is somewhat
undecipherable.
Original comment by emily@soldal.org
on 30 Jan 2012 at 6:43
I suppose we could get some ourselves by just running the C++ version?
Original comment by wasserman.louis
on 31 Jan 2012 at 11:55
I have a patch that's passing the same tests the C++ implementation provides.
It doesn't include the CRC hashes, but it includes seeded and unseeded 64 and
128-bit hashes.
It's a very direct translation, I'm afraid.
Original comment by wasserman.louis
on 9 Apr 2012 at 11:08
Fancy! Maybe we can get Geoff Pike or Jyrki Alakuijala to review it for us
internally...once you've got it cleaned up, mind sending it our way?
Original comment by kak@google.com
on 10 Apr 2012 at 1:13
I don't have much hope that I can actually significantly simplify the
implementation -- being a hash function, it's pretty deliberately chaotic in
many ways -- but I am making at least some attempt to optimize the
implementation, given that Java doesn't exactly have the direct memory access
advantages of C++, and part of the whole point of CityHash was speed.
In particular, I'm borrowing the approach from
UnsignedBytes.lexicographicalComparator() to get unsafe memory access with
sun.misc.Unsafe if possible, and falling back to a pure-Java implementation
otherwise. On my machine, that approximately triples the speed of the
implementation.
I think I'm about ready to send in a CL. Who should I be sending it to?
Original comment by wasserman.louis
on 10 Apr 2012 at 2:23
You can send it to me :-) I'll migrate it into google3 and get the right eyes
on it.
Regarding sun.misc.Unsafe, that's exactly the approach we use internally for
one of our uber-fast fingerprinting functions...so that sounds great :-)
Original comment by kak@google.com
on 10 Apr 2012 at 2:41
Update: The change is complete and is being migrated internally.
Preliminary indications are that the CityHash64 implementation is ~2x as fast
as the next-fastest hash function provided in Guava -- murmur3_32 -- although
since it isn't a "streaming" hash function, it can't be used for goodFastHash.
Additionally, CityHash128 is ~3-4x faster than murmur3_128.
I believe Kurt is trying to get the original authors of CityHash to take a look
at the implementation before we export it, but these results are pretty cool!
Original comment by wasserman.louis
on 29 Apr 2012 at 5:42
OK, so there's a small issue that's stalling the review internally: versioning
The CityHash algorithm isn't frozen...meaning it might change in the future.
Currently they're on v1.0.3 (externally) and have been working towards v1.0.4
internally.
Right now the proposal is to add these methods to Hashing:
public static HashFunction cityHash103_64()
public static HashFunction cityHash103_64(long seed)
public static HashFunction cityHash103_64(long seed0, long seed1)
public static HashFunction cityHash103_128()
public static HashFunction cityHash103_128(long seed0, long seed1)
However, when CityHash v1.0.4 comes out, we'll need to add *another* *FIVE*
methods to the Hashing API (and we probably can't even deprecate the existing
ones since people might have persisted them somewhere!). The algorithm itself
is unlikely to change much, but there might be small bug fixes (fix some corner
cases or something) that will mean a whole new minor version number. This whole
situation kinda stinks :-/
Any thoughts Louis?
Original comment by kurt.kluever
on 8 May 2012 at 8:21
BTW, Geoff Pike said: "I have changes in hand I want to get out there, and that
should happen this month on the opensource side. After that, changes will be
driven by bug reports, pretty much."
So I think our plan is to hold off with the Java impl until 1.0.4 is open
sourced, then get this submitted, and then deal with adding 1.0.5 if and when
that gets released.
Original comment by kurt.kluever
on 8 May 2012 at 8:48
I almost want to suggest an enum or something? Blegh.
Alternately, we could just state in the documentation that it might change
incompatibly in the future, and that currently it's consistent with CityHash
1.0.3. Something along the lines of the documentation in goodFastHash.
Original comment by wasserman.louis
on 8 May 2012 at 9:34
Nobody would be able to persist the hash codes then...which isn't good. Let's
table this for a while until 1.0.4 is out and see if anybody comes up with any
better ideas in the meantime.
Original comment by kurt.kluever
on 8 May 2012 at 10:05
Original comment by kevinb@google.com
on 30 May 2012 at 7:43
CityHash 1.1 was released on Oct 22, presumably should be pretty stable by
now...
Original comment by phraktle
on 10 Dec 2012 at 11:45
Does SIPHash also have a chance of being included?
The top two google search results for "siphash java" both contain the same bug
(they sign-extend input bytes when they shouldn't.) Having it in a more
popular library such as Guava might save some developers from being caught by
this.
Original comment by fin...@gmail.com
on 15 Dec 2012 at 3:35
@finnw1: can you please file a separate issue for SIPHash with some relevant
links? Thanks :-)
Original comment by kak@google.com
on 15 Dec 2012 at 11:50
This issue has been migrated to GitHub.
It can be found at https://github.com/google/guava/issues/<id>
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:14
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:18
Original comment by cgdecker@google.com
on 3 Nov 2014 at 9:08
Original issue reported on code.google.com by
emily@soldal.org
on 30 Jan 2012 at 11:46