WebAssembly / stringref

Other
37 stars 2 forks source link

New instruction `string.hash` #60

Open Liedtke opened 1 year ago

Liedtke commented 1 year ago

We have another instruction that is added to v8 for our performance experiments:

The reasoning for this instruction is that any JavaScript engine already requires hashed strings, so with stringrefs there is already space reserved in the object for it and there are fast engine-internal implementations available handling different kinds of internal string representations etc. Most languages require hashes for data structures and with stringrefs they can't add a hash property to strings, so keeping an application-defined hash together with the string is difficult / impacts performance.

dcodeIO commented 1 year ago

Wondering: Would it make sense to extend this concept so any kind of host reference can be hashed / reuse what only an engine in control of the references can do? Say, if one wants to implement a Set or Map in WebAssembly, and the key there is an anyref (that might also be a stringref) for example, then there is currently no good way to reliably generate a hash, since there is no numeric pointer and the contents are not necessarily known. Right now seems that one would import JS Set or Map to tap into the engine's hashing, but that won't be portable?

jakobkummerow commented 1 year ago

Extending built-in hashes to other objects could indeed make sense, but for WasmGC objects is most likely post-MVP material. It could well be that for stringrefs it'll be a post-MVP topic as well; for now we are just enabling experimentation.

Regarding application-defined-ness, it gets even worse: V8's hashes are seeded, and V8 allows embedders to randomize the hash seed on startup, which means that the same string will have a different hash when the program runs in a different process. AFAIK Chrome does not make use of this feature, but Node does (to protect against certain kinds of DoS attacks). It's not clear to me whether this would be a problem for Wasm use cases; it's also not clear to me what could be done differently if it is determined to be a problem: changing the internal hashing strategy would be a regression for use cases that want the randomization, storing an internal and a Wasm-exposed hash separately (1) would be nasty or impossible to implement and (2) would undo most of the efficiency benefits of making string.hash a built-in operation. So this is certainly something that needs to be thought through before standardizing string.hash.

eqrion commented 1 year ago

SpiderMonkey does not have a pre-computed hash stored on most JS strings. It is computed on demand and then cached through other means.

Also, is this the first time in the web platform we'd directly expose a platform provided hash code for JS strings? If so, this seems like something that may need to be floated with TC39, and also exposed through the String prototype.

jakobkummerow commented 1 year ago

@eqrion:

V8 also computes string hashes on demand; we certainly shouldn't create an expectation or specification that string.hash is always just field load.

I agree with the concern about exposing internal hashes.

Feel free to drag your feet on implementing this instruction. We don't even know yet how helpful it'll be. We'll report back when we have performance results from experimenting!

wingo commented 1 year ago

I think it may make sense to define the hash function. The reason is that you would like to be able to generate static hash dispatch tables for dispatch over a known set of strings:

(br_table $L0 $L1 ... $LN-1
   (i32.rem_u (string.hash (local.get $str)) (i32.const N))

But of course you will only be able to do that if you know the hash function at compile-time.