Support for save/load on hash internal state to allow resumable hashing

thenickdude commented 3 years ago

In my project I'm hashing some very large files using AWS Lambda, which has a fixed maximum execution time. So partway through the file when I run out of execution time, I need to be able to save the state of the hash to disk, start a new Lambda, and resume that state in the new Lambda to continue hashing the file.

So far I have implemented this just for SHA1/CRC32, which are the hashes I'll be using.

Is this a concept that you're interested in merging? If so I could add support for more/all hashes.

thenickdude commented 3 years ago

after that, WASMInterface could take over the responsibility of slicing and rewriting the memory region corresponding to the hash state.

Is that possible? I think the state object will have its own internal padding and alignment which would be hard for the JavaScript code to predict? (This is why I manually serialised each field instead of writing the entire struct in one go)

Daninet commented 3 years ago

Yeah it's possible. At WASM targets, alignment of fields are decided on compile-time and it will be consistent across all platforms. Also the binaries are compiled once for each version, so the same version of 'hash-wasm' will always have the same memory layout for the state buffer. It's very unlikely that the alignment will change between versions but still, I don't think we would need intercompatibility between different versions of the 'hash-wasm' library. We can highlight this in the readme.

thenickdude commented 3 years ago

Could the padding bytes be uninitialised memory which leaks state from some previous execution though? It seems safer not to include any padding bytes in the serialised form.

Daninet commented 3 years ago

WASM memory is always initialized to zero before the instance is created. Padding remains padding even during execution so it cannot happen that a previous hash uses a different padding for the same struct. If somebody is concerned about leaking out state then he can delete all references to the library instance in JS. That will destroy the whole WASM instance including with the associated memory area.

thenickdude commented 3 years ago

Makes sense. I'll update the PR with these suggestions and start adding more supported hashes

thenickdude commented 3 years ago

Is this new interface looking good now?

The C code now supplies the size of its state with an exported "STATE_SIZE" constant, and returns a pointer to a buffer of this size from a new Hash_GetState() function. WASMInterface takes care of the rest. (Still only implemented for SHA1 and CRC32 so far)

Daninet commented 3 years ago

Yeah, good job! 👍 I'm waiting for the rest of the algorithms before merging it. Maybe the unit tests could be generalized a bit and moved to "api.test.ts". There is an example there which cycles over all "createXXX" functions.

thenickdude commented 3 years ago

Okay, I've implemented the other hashes too now, moved the tests into a pair of generic tests in api.test.ts, and added a section to the Readme.

The C hash functions for Whirlpool, xxhash32 and xxhash64 have had their loose global state variables moved into a struct. For xxhash64 I double checked using the benchmarking tool that this didn't have any negative impact.

createHMAC() still throws when .save() or .load() is called on it. We might not want to support these operations on this one because the saved state can leak the signing key, which would be a disaster.

thenickdude commented 3 years ago

We're not supporting compatibility between states saved on different builds of hash-wasm, which makes sense because otherwise you'd have to nail down the internal state of the hash functions, and then the implementations couldn't be improved.

But I just realised that this will prevent me from deploying a new version of my function unless I can be absolutely sure there are no jobs currently in progress, because updated .load()ers could silently give the wrong hashes if they read a state that was still in flight after being written by an old .save() version.

So I'm thinking I should add a hash of the WASM directory to the saved state, which is then verified on state load? This way we can reliably throw an exception when trying to load a state created by a different build. The hash can be calculated during the build step, so it won't really slow this down.

thenickdude commented 3 years ago

I've added a commit which implements this compatible-state-verification feature now.

Daninet commented 3 years ago

I believe this detection could be improved. It's highly unlikely that the alignment of the fields will change between versions so I think it's too restrictive to bail out the state loading every time when there is a smaller change in one of the binaries. To address the issue, I suggest generating a hash of the state, individually for each algorithm. make_json.js should be able to add a new field to output JSON files containing the hash of the saved state after hashing a long predefined constant. Also, to keep things efficient I think it's more than enough to use the first 4 bytes of the SHA1 hash output instead of the full hash.

thenickdude commented 3 years ago

It would be relatively easy in that situation to end up with state bytes that never get touched during the hash of the predefined constant, for example the high bytes of length fields. These bytes could be filled with undefined padding values in one hash version, then in a later hash version which increases the size of one of those fields (from uint32_t to uint64_t, say) those uninitialised padding values could suddenly affect the behaviour of the hash without being detected by your scheme.

thenickdude commented 3 years ago

I think a better idea would be to investigate reproducible-build flags for Clang so the WASM binaries don't change if the binary doesn't get edited, and then go for the per-function 4-byte hash idea you suggested (hashed over the compiled WASM binary).

Daninet commented 3 years ago

Yeah, it would be hard to come up with a good sequence of predefined constants, especially with more complicated hash functions like BLAKE3. What about hashing the WASM binaries individually then? Hashing all files together does not make sense to me, because it's more complex but not really different from using the package version from package.json as a differentiator.

thenickdude commented 3 years ago

Hashing all files together does not make sense to me, because it's more complex but not really different from using the package version from package.json as a differentiator.

Hashing the files accounts for people who build hash-wasm themselves with different Clang versions, generating different output without bumping the package version.

Daninet commented 3 years ago

Hashing the files accounts for people who build hash-wasm themselves with different Clang versions, generating different output without bumping the package version.

Good point. I think we both agree that hashing the binaries individually is the best solution. :)

Daninet commented 3 years ago

With the current clang flags + docker-based setup the builds should be already reproducible. This was an important feature from the beginning to have a transparent and reproducible build process.

thenickdude commented 3 years ago

Okay, I'll work on including the per-function code hashes.

thenickdude commented 3 years ago

I've implemented this now!

Daninet / hash-wasm

Support for save/load on hash internal state to allow resumable hashing #17