WebAssembly / js-string-builtins

JS String Builtins
Other
9 stars 7 forks source link

Substrings: slices vs. copies #1

Open jakobkummerow opened 1 year ago

jakobkummerow commented 1 year ago

The WebAssembly.String.substring operation in the draft spec doc creates a string S2 that's a substring of another string S1. V8 originally implemented this by reusing its existing machinery for creating string slices, i.e. S2 would only store a pointer to S1, a start offset, and a length, making the substring operation itself very cheap because it doesn't need to copy any characters, and making S2 very memory-efficient (especially when it's long, so that many characters are shared with S1).

Like almost any engineering technique, this approach is a tradeoff with pros and cons. In this case, its drawback is that a relatively short string (S2) can be keeping an otherwise-unreachable other string (S1) alive, which might be much bigger. We have just encountered a real-world use case where this happens a lot and has detrimental effects on memory consumption: user-space JSON parsing. Any string values in the decoded data keep the entire JSON source string in memory when they are represented as string slices. (Note that the specifics of the JSON format are irrelevant, the same would happen for any other string-based serialization format where string values occur as substrings of the serialized data, JSON just happens to be a very common case of this.) For comparison, JavaScript doesn't run into this problem, because the implementation of JSON.parse never creates sliced strings, because it can make a very reasonable assumption that the JSON source string is meant to be thrown away when the parsing operation is complete.

As a short-term mitigation, we have just changed V8's implementation of WebAssembly.String.substring to make full copies when the substring is at most half as long as the original string. A possible longer-term engine-side mitigation might be to build a tracking mechanism across the entire managed heap that figures out which long strings are being used by which slices, and can then decide whether converting these slices to copies would be overall beneficial or not. That's fairly heavyweight machinery for solving a rather limited problem though, so while I'd be excited to see this prototyped, I'm not convinced that it would actually be a reasonable choice for engines to spend the engineering hours to build such machinery and, more importantly, the runtime CPU cycles to operate it.

As a proper solution, it would seem reasonable to me to add a way to let Wasm modules explicitly control the desired behavior:

I don't feel strongly about how this will be solved, I'm just saying that I think it would make sense to have some improvement over the initial design sketch in this regard.

Without new operations, copies of string slices can be forced by converting the string to an array and then back to a string (using W.S.toWtf16Array + W.S.fromWtf16Array); engines could conceivably learn to recognize this pattern and shortcut it to avoid the double copy (but not shortcut it too much: if they avoided both copies, that would defeat the point). IMHO this would not be a pretty solution though; it feels similar to dubious JavaScript performance tricks like "temporarily install this object as the prototype of some other dummy object just to make the JS engine to switch its internal representation into a certain mode".

(Same discussion for stringref: https://github.com/WebAssembly/stringref/issues/63)

eqrion commented 11 months ago

Sorry for the delay in responding.

I am unsure about whether it makes sense to allow hinting for 'slice' vs 'copy'. The optimization of forcing an eager copy when copying a small percentage of a large string seems fairly easy to do, so I'd lean towards just expecting engines to do that. It seems like a nice refinement of existing optimizations.

If we had a hint, I would wonder if engines would still do this check (as it's cheap) to see if they want to override the hint for cases where toolchains get it wrong.