Discrete-Incremental Hashing

avaneev commented 2 years ago

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

But anyway, please take a note that I've found and tested an "incremental" hashing approach with komihash, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.

oertl commented 2 years ago

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

This is due to Java's UTF16-like in-memory representation of strings which uses 2 bytes per character.

oertl commented 2 years ago

Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

I will do some benchmarks to check if this makes a difference. This is an optimization, which the Java compiler could probably do automatically.

avaneev commented 2 years ago

Thanks. Didn't notice the char values are UTF-16.

oertl commented 2 years ago

The incremental approach might be an interesting alternative. If I find the time I will do an alternative implementation and look at the benchmarks.

I am not an expert in designing hash functions, but I am wondering if this incremental approach might lead to a worse hash-quality? My gut feeling is that it is better to serialize an object and then apply a hash function once than to apply the hash function recursively on chunks. After all, there must be a reason why most hash functions use an internal state that is greater than 8 bytes.

avaneev commented 2 years ago

The quality is fine. Komihash has 128-bit "state space", so collisions do not happen. Anyway, I've tested the approach in SMHasher without issues.

You can optimize incremental hashing for char, byte, short, long value inputs quite a bit - instead of calling a single static hash function you can add "truncated" implementations for specific input lengthes.

oertl commented 2 years ago

What I mean is that only 64 bits of information is transferred from one chunk to another with the incremental aproach. If everything is hashed at once, there is much more information that can somehow interact due to the larger internal state of the hash function.

avaneev commented 2 years ago

Well, look at hash value as a "starting point". The next hashing round moves this "starting point" to another position. This may not work with other hashes, but it works with komihash due to its qualities. Whether you hash in a single pass or split the input into separate incremental byte-wide inputs, the hash quality remains the same, I've tested this.

oertl commented 2 years ago

@avaneev, I have implemented your suggestions (see #50), but could not see any performance improvement in my benchmarks. I assume that the Java compiler does this kind of optimizations automatically.

avaneev commented 2 years ago

Understood. Anyway, it looks cleaner that way I think.

oertl commented 1 year ago

reopening this issue as it was unintentionally closed

oertl commented 1 year ago

@avaneev, after implementing the streaming variant (see https://github.com/avaneev/komihash/issues/8), do you think the incremental approach is still worth considering? I tend to keep the current approach, which is also more safe for other hash algorithms that do not have the required properties for the incremental approach.

avaneev commented 1 year ago

Discrete-incremental approach is very useful for database hashing, not only because it is faster, but also because it implicitly separates the data items. For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes. Discrete-incremental hashing will produce two different hashes in this case.

oertl commented 1 year ago

For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes.

This is why hash4j provides methods that append the lengths of variable-length fields to make hashes unique. See for example: https://github.com/dynatrace-oss/hash4j/blob/17bfe55a574e48d3f7697fa2fea6a9a8cfe07cbc/src/main/java/com/dynatrace/hash4j/hashing/HashSink.java#L287-L306

avaneev commented 1 year ago

Yes, I understand. But the approach is not ideal, because length is a binary value and in some cases if you append an independent binary value after the string, a collision may happen anyway.

oertl commented 1 year ago

Using the binary representation after serializing the object avoids hash collisions by definition. Appending the lengths of all variable length fields of an object is equivalent to serialization, since the object can be reconstructed from this information. Therefore, the hash4j approach should be fine.

avaneev commented 1 year ago

I think this leaves a chance to programming error. Beside that, appending string's length is an unnecessary overhead compared to discrete-incremental approach.

oertl commented 1 year ago

For nested data structures like a list of lists (e.g. List<List<Integer>>) the incremental approach is not straightforward either. Either one needs another hasher instance or one has to add the lengths of the inner lists.

avaneev commented 1 year ago

For nested structures, it will be enough to add zero-length hashing round Hash = komihash( &tmp, 0, Hash ); for each nesting. It's a data-less round, and it's very fast, at least in C.

oertl commented 1 year ago

Do you mean that

{{"ab", "cd"}, {"ef", "gh"}}

would be hashed as

komihash << "ab" << "cd" << "" << "ef" << "gh"

where the <<-operator denotes a hashing round and "" indicates a zero-length hashing round?

But then

{{"ab", "cd", "", "ef", "gh"}}

with an empty string in the middle would give a hash collision.

avaneev commented 1 year ago

It should be more like komihash << "" << "ab" << "cd" << "" << "ef" << "gh"

But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.

avaneev commented 1 year ago

So, a sub-hasher is probably needed that will look like: komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh") From performance point, it's as fast as adding zero-lengthes. But anyway, a counter-example can still be given, except this will involve random numbers, so it won't look obvious.

avaneev commented 1 year ago

In my proposal I've assumed zero-terminated strings, so their lengths are non-zero even when they are empty. But that's also not perfect, because individual zero binary bytes can also happen in a hash stream.

oertl commented 1 year ago

But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.

When lengths (also those of the strings) are always appended, it should be fine:

{{"ab", "cd"}, {"ef", "gh"}}

would lead to a binary representation corresponding to

['a', 'b', 2, 'c', 'd', 2, 2, 'e', 'f', 2, 'g', 'h', 2, 2, 2]

wheras

{{"ab", "cd", "", "ef", "gh"}}

would give

['a', 'b', 2, 'c', 'd', 2, 0, 'e', 'f', 2, 'g', 'h', 2, 5, 1].

Of course, for tiny strings there is a lot of overhead, but in general it is a safe strategy which works with just a single hasher instance.

So, a sub-hasher is probably needed that will look like: komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh")

This is what I meant with "another hasher instance". This strategy requires creating further instances which also comes with some overhead (especially in Java). Hashing a nested data structure requires as many hasher instances as nesting levels.

avaneev commented 1 year ago

I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only.

Your example above collides with {"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.

oertl commented 1 year ago

I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only

Agree, but this could be difficult to realize in Java without additional overhead or object allocations.

Your example above collides with {"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.

This is different, as this is a list of some polymorphic type (able to represent integers and strings). If you hash a polymorphic object you always have to include the type information. Then there will be no collision.

avaneev commented 1 year ago

I think that additional overhead for hash value stack would be much lower than the overhead associated with storing string lengths and types (for polymorphic objects), and buffering of course. Discrete-incremental capability may be a new word in high-performance database hashing since it is fast, universal, and type-agnostic.

oertl commented 1 year ago

I think type information is also necessary for the incremental approach in case of polymorphic types. Since, as you mentioned, Komihash might be the only algorithm that has the properties required for this incremental approach, I still think it is better to use the current approach by default. However, I see the potential and will consider adding this incremental approach as an alternative in the future. I will keep this issue open in the meanwhile.

You mentioned that there are statistical tests that prove that Komihash produces equally good hashes when it is run incrementally (e.g. when only one byte is added in each hashing round). Are such tests already part of smhasher?

avaneev commented 1 year ago

No, it is not in SMHasher. Here's the code you can try if you have SMHasher, simply change komihash() to komihash_di() call in SMHasher code. (its hash identifier differs, though)

static inline uint64_t komihash_di( const void* const Msg, const size_t MsgLen,
    const uint64_t UseSeed )
{
    if( MsgLen == 0 )
    {
        char tmp;
        return( komihash( &tmp, 0, UseSeed ));
    }
    uint64_t Hash = UseSeed;
    for( size_t i = 0; i < MsgLen; i++ )
    {
        Hash = komihash( (const uint8_t*) Msg + i, 1, Hash );
    }
    return( Hash );
}

dynatrace-oss / hash4j

Discrete-Incremental Hashing #49