Hashing a CharSequence directly

OpenHFT / Zero-Allocation-Hashing

Zero-allocation hashing for Java

Apache License 2.0

787 stars 136 forks source link

Hashing a CharSequence directly #62

Closed talwgx closed 3 years ago

talwgx commented 3 years ago

Hello! I'd like to be able to hash parts of a long string without having to copy them into a separate StringBuilder / String etc.. For this I'd like to be able to implement CharSequence in order to present the chars I would like to hash.

The public methods for LongHashFunction only accept String and StringBuilder which are final. However there is the package private method: long hashNativeChars(CharSequence input) in LongHashFunction.java which would provide that. Is there a reason why it cannot be made public to allow access from other packages to it? (without ofc resorting to accessing it through the private-package scope). That would be really great. Thanks!

gzm55 commented 3 years ago

https://github.com/OpenHFT/Zero-Allocation-Hashing/blob/master/src/main/java/net/openhft/hashing/LongHashFunction.java#L606

hashChars(String,int,int) should work for you case.

talwgx commented 3 years ago

Thanks for the feedback. Sadly, it's not one section that I need to hash, but a number of sections from a long string. I'd really like to avoid having to copy them to a separate StringBuilder etc.. as that would defeat the zero allocations goal. I could implement a ByteBuffer that will produce the right set of longs from the string, but that seems like a crude / convoluted solution compared to a CharSequence impl which I use frequently for this case. Given the lib already supports CharSequences I was wondering if there was a drawback to allowing them to be passed? Guava hashing accepts CharSequences, but as we know basically copies them to byte[] which I def don't want to do! :) Cheers.

gzm55 commented 3 years ago

The library will select a faster method to read ONE slice of a string, trying best to use the internal array of string, and fall back to the CharSequence api.

If you need to hash multiple slices in a single hash calculation, it seems no api to directly support your case. Besides a ByteBuffer, you can also implement a special CharSequenceAccess.

If you need to hash slices one by one, the former api should work as you wish.

talwgx commented 3 years ago

It seems I could use CharSequenceAccess as is, but it too is private-package scoped. If either it or hashNativeChars(CharSequence input) in LongHashFunction are made public I could easily leverage either towards a clean solution. If making them public isn't an option, than it sounds like my best bet is to consume them by declaring an accessor class with the net.openhft.hashing package which feels a bit "hacky", but if it is what it is, then that's that :)

gzm55 commented 3 years ago

it could be ok to public CharSequenceAccess.

@talwgx for a string "__AAA__BBB__", which is your case:

a) h = hash("AAABBB"); b) h1 = hash("AAA"); h2=hash("BBB");

talwgx commented 3 years ago

That'll be great!