Streaming extension - Githubissues

tdeenes commented 1 month ago

@Nicoshev Do you plan to add streaming capability in a similar vein to XXH64_update? My use case would be to calculate the hash of a file incrementally, e.g., like this hashFile example.

trapexit commented 1 month ago

If rapidhash is like xxhash it really can't do streaming. The value, at least of xxhash, will change depending on block size. But of course that is for @Nicoshev to comment on authoritatively.

Nicoshev commented 1 month ago

I think it is possible to achieve. The complexity to do so derives from whether we require the final hash to be equal as if the whole buffer had been hashed at once.

If hashes can be different, then we can just call the hash function repeatedly. We would need to test ways to maintain a main seed that does not cause the hash quality to diverge as calls go by.

If the final hash needs to be equal to hashing the buffer at once, we need to maintain a state containing: the total bytes hashed so far, the number of outstanding bytes without hashing and 3 seeds. With that information we can hash chunks of 48 bytes independentlly and then have a separate rapidhash_finalize function that computes the last bytes. This finalize function will either call regular rapidhash if up to 16 bytes have been requested to process and the piece of code that comes after the loop otherwise. The most challenging part of this approach is to find a way to reuse code while guaranteeing no performance penalty is introduced to the main hash function.

Maybe the best way is to encapsulate the compact loop code in a function and call it from the main rapidhash function. The function should get inlined; but if not, utilizing the compact loop already trades-off performance, so we would be hurting the case that tends to be the less performant. The update function would call that function if the outstanding bytes are at least 96, always leaving at least 48 bytes unprocessed. rapidhash_internal needs to be modified to extirp away the first line that modifies the seed. This line could be placed in rapidhash_withSeed. Then, the finally function would just need to call rapidhash_internal and I believe it would work.

Cheers, Nicolas

tdeenes commented 1 month ago

My understanding is that xxhash does almost the same that you described (see here), and due to the overhead, it keeps the implementation of the streaming logic separated.

trapexit commented 1 month ago

Sorry. I meant wyhash. Not xxhash. My bad.

oertl commented 1 month ago

@trapexit hash4j includes a streaming version of WyHash and XXH3 in Java. We also plan to add a streaming version of RapidHash as soon as there is an official release (see #4).

Techcable commented 3 weeks ago

Due to the design of Rust's std::hash::Hasher trait, this is necessary for a rust port.

haberman commented 2 weeks ago

If hashes can be different, then we can just call the hash function repeatedly. We would need to test ways to maintain a main seed that does not cause the hash quality to diverge as calls go by.

I am interested in this use case. Will hash quality degrade if I call:

  uint64_t hash = RAPID_SEED;
  hash = rapidhash_withSeed(buf1, len1, hash);
  hash = rapidhash_withSeed(buf2, len2, hash);
  hash = rapidhash_withSeed(buf3, len3, hash);
  // etc.

I don't need this to yield the same hash as the concatenated buffer would have, but I do want the result to have the same quality.

amateur80lvl commented 2 weeks ago

I implemented streaming version for my own needs. It's a bit simplified given that I work with specific data and don't have to worry if len < 16. But this len is what actually stops implementing streaming in the right way. It is involved in hash calculations but when we deal with streaming we often don't know the length. What if we omit it? I did not notice any difference in quality but I'm not an expert in this area. And if we do that, it will be just an incompatible version at a global scope which should be named accordingly. Rapidhash2, probably :)

oertl commented 2 weeks ago

@amateur80lvl this can be solved by using a small buffer. For example, the streaming Wyhash implementations in hash4j uses a fixed-size buffer of 56 bytes. I have not yet worked on it, but I expect it will work similarly for Rapidhash as it is derived from Wyhash.

amateur80lvl commented 2 weeks ago

@oertl, if you know len in advance, streaming API is not always necessary and can be worked around with memory mapped files, for example. As for real streaming, len is an obstacle:

https://github.com/Nicoshev/rapidhash/blob/588978411df8683777429f729be5213eb1bfd5f3/rapidhash.h#L245

oertl commented 2 weeks ago

@amateur80lvl, you are right. I overlooked this line, which prevents a streaming implementation. The same problem existed with Wyhash (see https://github.com/wangyi-fudan/wyhash/issues/121).

amateur80lvl commented 2 weeks ago

@oertl basically, I always know size of data in my apps and can't imagine someone need a hash of indefinite data. But... scanning C strings twice is annoying.

Upd:

I'll leave it here just in case. This is my version before I removed len, so if fed with pairs of uint64, the result was the same as produced by rapidhash. For general-purpose version, need to incorporate the case when len < 16 from the original code instead of padding. Plus, the buffer probably should be changed to uint8_t with all those rapid_read brought back.

typedef uint64_t  AAAType_Hash;

typedef struct {
    uint64_t seed;
    uint64_t see1;
    uint64_t see2;
    uint64_t buffer[6];
    size_t len;
    int buf_size;
} AAAHashContext;

//#define TRACE(...)  printf(__VA_ARGS__)
#define TRACE(...)

void aaahash_init_context(AAAHashContext* ctx, size_t len)
{
    ctx->seed = RAPID_SEED ^ rapid_mix(RAPID_SEED ^ RAPID_SECRET_0, RAPID_SECRET_1) ^ len;
    ctx->see1 = ctx->seed;
    ctx->see2 = ctx->seed;
    ctx->len = len;
    ctx->buf_size = 0;
}

void aaahash_add_uint64(AAAHashContext* ctx, uint64_t data)
{
    if (ctx->buf_size == 6) {
        ctx->buf_size = 0;

        ctx->seed = rapid_mix(ctx->buffer[0] ^ RAPID_SECRET_0, ctx->buffer[1] ^ ctx->seed);
        ctx->see1 = rapid_mix(ctx->buffer[2] ^ RAPID_SECRET_1, ctx->buffer[3] ^ ctx->see1);
        ctx->see2 = rapid_mix(ctx->buffer[4] ^ RAPID_SECRET_2, ctx->buffer[5] ^ ctx->see2);
        TRACE("AAA round seed=%llx\n", (unsigned long long) ctx->seed);
    }
    ctx->buffer[ctx->buf_size++] = data;
}

AAAType_Hash aaahash_finish(AAAHashContext* ctx)
{
    ctx->seed ^= ctx->see1 ^ ctx->see2;

    // padding instead of handling len < 16
    while (ctx->buf_size < 2) {
        TRACE("PAD %d\n", ctx->buf_size);
        ctx->buffer[ctx->buf_size++] = 0;
    }

    if (ctx->buf_size > 2) {
        ctx->seed = rapid_mix(ctx->buffer[0] ^ RAPID_SECRET_2, ctx->buffer[1] ^ ctx->seed ^ RAPID_SECRET_1);
        TRACE("AAA %d seed=%llx\n", ctx->buf_size, (unsigned long long) ctx->seed);
        if (ctx->buf_size > 4) {
            ctx->seed = rapid_mix(ctx->buffer[2] ^ RAPID_SECRET_2, ctx->buffer[3] ^ ctx->seed);
            TRACE("AAA %d seed=%llx\n", ctx->buf_size, (unsigned long long) ctx->seed);
        }
    }

    uint64_t a = ctx->buffer[ctx->buf_size - 2] ^ RAPID_SECRET_1;
    uint64_t b = ctx->buffer[ctx->buf_size - 1] ^ ctx->seed;
    TRACE("AAA a=%llx b=%llx\n", (unsigned long long) a, (unsigned long long) b);
    rapid_mum(&a, &b);

    return rapid_mix(a ^ RAPID_SECRET_0 ^ ctx->len, b ^ RAPID_SECRET_1);
}

So, everyone can derive a streaming version that suits their needs. If I was an author of rapidhash I'd probably rejected any extensions to avoid pollution of the codebase.

Nicoshev commented 2 weeks ago

@amateur80lvl is right, the total hashed size in bytes is used to initialize the seed; and this may not be known in beforehand when doing stream hashing.

Removing that '^len' causes the hash to not pass SMHasher3. When that xor is not present, appending or prepending 0s to the hashed buffer does not cause every bit to change with near 50% probability. This results in a notably higher collision rate when prepending or appending 0s.

Appending 0s is actually more likely to happen in a streaming scenario, where input is added at the end of already hashed data. So we need to address this property correctly, I'll think on how can we do it without knowing the final length in beforehand.

Nicoshev commented 2 weeks ago

@haberman Calling rapidhash multiple times using the previous result as seed should be mostly fine. Quality may drop a bit, but from the testing I have done, the collision probability should remain below 1/(2^63).

I'll try to do further testing to confirm this theory.

tdeenes commented 2 weeks ago

@Nicoshev I think many of us would already benefit from a rapidhash version which would provide an interface for known len (or a rapidhash_file function directly). So if a real streaming solution turns out to be more challenging, even this pseudo-streaming extension would be helpful.

trapexit commented 2 weeks ago

Can people step back and articulate the reason for wanting a streamable version? wyhash and rapidhash are good for short data sets. For using to hash a string in a data structure for example. If the algo doesn't work for longer data sets then it doesn't work. I don't see a value in trying to shoehorn it into a usecase. Decisions made to be fast and low collision probability and goot distribution on short arrays of data may be incompatible with something intended for more traditional uses.

oertl commented 2 weeks ago

If a hash algorithm is not streamable and you need to hash more complicated objects, they must be serialized to a byte buffer first. This may require the allocation of a byte buffer or using a static one. Even worse, sometimes you don't know the size in advance, such that the byte buffer must grow during serialization or the total length must be determined in a previous iteration. With a streamable algorithm, the fields and nested objects can be directly fed into the hash algorithm without additional memory allocations, copying data, or multiple roundtrips. Consider hashing an object with multiple null-terminated string fields as a simple example.

trapexit commented 2 weeks ago

and you need to hash more complicated objects

But that's the question / point. It isn't designed for that. It's not intended to be a replacement for generic hash functions. And it wouldn't be at all surprising that if you changed it to be like them you'd lose the advantages it has for small strings... it's main use afaik.

There is nothing today that keeps you from making it "streamable". It takes (key, len, seed) which is all you need. The problem is that the len impacts the output. But that will be consistent so long as the input is consistent.

oertl commented 2 weeks ago

It isn't designed for that.

Rapidhash is the official successor to Wyhash, which can be implemented to work with streams. Therefore, it is not unreasonable to expect the same from Rapidhash.

trapexit commented 2 weeks ago

which can be implemented to work with streams.

https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h

I see no wyhash_update(). It was added for one or two releases and then removed.

https://github.com/wangyi-fudan/wyhash/issues/131

oertl commented 2 weeks ago

To be clear, I'm talking about streamable when the algorithm can be implemented for streams and only requires a small fixed-size buffer to get the same hash as copying the entire stream into a byte array and applying the reference implementation. In this sense, the current implementation of Rapidhash is not streamable, while Wyhash and other competing algorithms like Komihash, Farmhash, XXH3, Murmur3, or Polymurhash are streamable.

trapexit commented 2 weeks ago

How is it that wyhash is streamable? There are numerous tickets open on that very topic. I linked to the latest release. There is no start, update, or finalize function and as pointed out in the ticket I linked to the algo has to be changed to make it work generically. Which would need to be a different version or name as it would lead to different results. And those results would have to be checked against the original for quality.

oertl commented 2 weeks ago

Please see hash4j for a streaming version of Wyhash implemented in Java that is compatible with the latest release wyhash_final4.

Nicoshev commented 1 week ago

If the '^len' operation is removed from the seed initialization routine, and processing of variable 'b' on line 292 gets changed to 'b^=seed^len;', the function becomes streamable and still passes both SMHasher and SMHasher3.

It does become a little bit worse regarding short input speed nevertheless, and I cannot confirm yet that it is also of higher quality than wyhash.

I do not know if it's worth to take a hit in speed and possibly quality just for making the function streamable.

thynson commented 1 week ago

Due to the design of Rust's std::hash::Hasher trait, this is necessary for a rust port.

Nope, according to the doc:

Nor can you assume that adjacent write calls are merged, so it’s possible, for example, that
    hasher.write(&[1, 2]);
    hasher.write(&[3, 4, 5, 6]);
    and

    hasher.write(&[1, 2, 3, 4]);
    hasher.write(&[5, 6]);
end up producing different hashes.

So it's valid to implement a std::hash::Hasher via self.seed = rapidhash_withSeed(slice, self.seed); in Rust; and from security perspective I personally feel such implementation resists from crafted tuples that lead to collisions.

thynson commented 1 week ago

If the '^len' operation is removed from the seed initialization routine, and processing of variable 'b' on line 292 gets changed to 'b^=seed^len;', the function becomes streamable and still passes both SMHasher and SMHasher3.

It does become a little bit worse regarding short input speed nevertheless, and I cannot confirm yet that it is also of higher quality than wyhash.

I do not know if it's worth to take a hit in speed and possibly quality just for making the function streamable.

I think it's worth a change as long as it pass SMHasher and SMHasher3.

Alternatively we could provide a streaming variant. And for hash tables, the construction of self.seed = rapidhash_withSeed(slice, self.seed); should work well in performance (or even better, due to buffering can be totally eliminated), although I'm not sure if there is any quality degration.

trapexit commented 1 week ago

1) If the performance of the hash (in any dimension) is worse then there should be no replacement of that hash algo. The original intent and usage is for small strings/data and that is what it should stay focused on IMO. 2) Unless a streaming variant is faster or has other better attributes when compared to other hashes... which can't be established yet since no algo has been proposed... there is no point to a streaming variant. What's the point of having an algo if it doesn't excel at something?

tansy commented 1 week ago

Because the hash is

using len to initialize the seed

there is no way to make it streaming, as one doesn't know the length of data beforehand, despite claims that you do.

Nicoshev / rapidhash

Streaming extension #3