Open avaneev opened 2 years ago
I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.
This is due to Java's UTF16-like in-memory representation of strings which uses 2 bytes per character.
Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.
I will do some benchmarks to check if this makes a difference. This is an optimization, which the Java compiler could probably do automatically.
Thanks. Didn't notice the char values are UTF-16.
The incremental approach might be an interesting alternative. If I find the time I will do an alternative implementation and look at the benchmarks.
I am not an expert in designing hash functions, but I am wondering if this incremental approach might lead to a worse hash-quality? My gut feeling is that it is better to serialize an object and then apply a hash function once than to apply the hash function recursively on chunks. After all, there must be a reason why most hash functions use an internal state that is greater than 8 bytes.
The quality is fine. Komihash has 128-bit "state space", so collisions do not happen. Anyway, I've tested the approach in SMHasher without issues.
You can optimize incremental hashing for char, byte, short, long value inputs quite a bit - instead of calling a single static hash function you can add "truncated" implementations for specific input lengthes.
What I mean is that only 64 bits of information is transferred from one chunk to another with the incremental aproach. If everything is hashed at once, there is much more information that can somehow interact due to the larger internal state of the hash function.
Well, look at hash value as a "starting point". The next hashing round moves this "starting point" to another position. This may not work with other hashes, but it works with komihash
due to its qualities. Whether you hash in a single pass or split the input into separate incremental byte-wide inputs, the hash quality remains the same, I've tested this.
@avaneev, I have implemented your suggestions (see #50), but could not see any performance improvement in my benchmarks. I assume that the Java compiler does this kind of optimizations automatically.
Understood. Anyway, it looks cleaner that way I think.
reopening this issue as it was unintentionally closed
@avaneev, after implementing the streaming variant (see https://github.com/avaneev/komihash/issues/8), do you think the incremental approach is still worth considering? I tend to keep the current approach, which is also more safe for other hash algorithms that do not have the required properties for the incremental approach.
Discrete-incremental approach is very useful for database hashing, not only because it is faster, but also because it implicitly separates the data items. For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes. Discrete-incremental hashing will produce two different hashes in this case.
For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes.
This is why hash4j provides methods that append the lengths of variable-length fields to make hashes unique. See for example: https://github.com/dynatrace-oss/hash4j/blob/17bfe55a574e48d3f7697fa2fea6a9a8cfe07cbc/src/main/java/com/dynatrace/hash4j/hashing/HashSink.java#L287-L306
Yes, I understand. But the approach is not ideal, because length is a binary value and in some cases if you append an independent binary value after the string, a collision may happen anyway.
Using the binary representation after serializing the object avoids hash collisions by definition. Appending the lengths of all variable length fields of an object is equivalent to serialization, since the object can be reconstructed from this information. Therefore, the hash4j approach should be fine.
I think this leaves a chance to programming error. Beside that, appending string's length is an unnecessary overhead compared to discrete-incremental approach.
For nested data structures like a list of lists (e.g. List<List<Integer>>
) the incremental approach is not straightforward either. Either one needs another hasher instance or one has to add the lengths of the inner lists.
For nested structures, it will be enough to add zero-length hashing round Hash = komihash( &tmp, 0, Hash );
for each nesting. It's a data-less round, and it's very fast, at least in C.
Do you mean that
{{"ab", "cd"}, {"ef", "gh"}}
would be hashed as
komihash << "ab" << "cd" << "" << "ef" << "gh"
where the <<
-operator denotes a hashing round and ""
indicates a zero-length hashing round?
But then
{{"ab", "cd", "", "ef", "gh"}}
with an empty string in the middle would give a hash collision.
It should be more like
komihash << "" << "ab" << "cd" << "" << "ef" << "gh"
But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.
So, a sub-hasher is probably needed that will look like: komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh") From performance point, it's as fast as adding zero-lengthes. But anyway, a counter-example can still be given, except this will involve random numbers, so it won't look obvious.
In my proposal I've assumed zero-terminated strings, so their lengths are non-zero even when they are empty. But that's also not perfect, because individual zero binary bytes can also happen in a hash stream.
But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.
When lengths (also those of the strings) are always appended, it should be fine:
{{"ab", "cd"}, {"ef", "gh"}}
would lead to a binary representation corresponding to
['a', 'b', 2, 'c', 'd', 2, 2, 'e', 'f', 2, 'g', 'h', 2, 2, 2]
wheras
{{"ab", "cd", "", "ef", "gh"}}
would give
['a', 'b', 2, 'c', 'd', 2, 0, 'e', 'f', 2, 'g', 'h', 2, 5, 1]
.
Of course, for tiny strings there is a lot of overhead, but in general it is a safe strategy which works with just a single hasher instance.
So, a sub-hasher is probably needed that will look like: komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh")
This is what I meant with "another hasher instance". This strategy requires creating further instances which also comes with some overhead (especially in Java). Hashing a nested data structure requires as many hasher instances as nesting levels.
I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only.
Your example above collides with {"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.
I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only
Agree, but this could be difficult to realize in Java without additional overhead or object allocations.
Your example above collides with {"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.
This is different, as this is a list of some polymorphic type (able to represent integers and strings). If you hash a polymorphic object you always have to include the type information. Then there will be no collision.
I think that additional overhead for hash value stack would be much lower than the overhead associated with storing string lengths and types (for polymorphic objects), and buffering of course. Discrete-incremental capability may be a new word in high-performance database hashing since it is fast, universal, and type-agnostic.
I think type information is also necessary for the incremental approach in case of polymorphic types. Since, as you mentioned, Komihash might be the only algorithm that has the properties required for this incremental approach, I still think it is better to use the current approach by default. However, I see the potential and will consider adding this incremental approach as an alternative in the future. I will keep this issue open in the meanwhile.
You mentioned that there are statistical tests that prove that Komihash produces equally good hashes when it is run incrementally (e.g. when only one byte is added in each hashing round). Are such tests already part of smhasher?
No, it is not in SMHasher. Here's the code you can try if you have SMHasher, simply change komihash() to komihash_di() call in SMHasher code. (its hash identifier differs, though)
static inline uint64_t komihash_di( const void* const Msg, const size_t MsgLen,
const uint64_t UseSeed )
{
if( MsgLen == 0 )
{
char tmp;
return( komihash( &tmp, 0, UseSeed ));
}
uint64_t Hash = UseSeed;
for( size_t i = 0; i < MsgLen; i++ )
{
Hash = komihash( (const uint8_t*) Msg + i, 1, Hash );
}
return( Hash );
}
I've noticed that your Java implementation of
komihash
uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.But anyway, please take a note that I've found and tested an "incremental" hashing approach with
komihash
, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to saver1l
,r2l
,r3l
,r4l
multiplication results into correspondingSeed1..4
values, and then XOR them withSeed5-8
values.To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.