WebAssembly / stringref

Other
37 stars 2 forks source link

Substrings: slices vs. copies #63

Open jakobkummerow opened 1 year ago

jakobkummerow commented 1 year ago

The stringview_wtf16.slice operation creates a string S2 that's a substring of another string S1. V8 originally implemented this by reusing its existing machinery for creating string slices, i.e. S2 would only store a pointer to S1, a start offset, and a length, making the slice operation itself very cheap because it doesn't need to copy any characters, and making S2 very memory-efficient (especially when it's long, so that many characters are shared with S1).

Like almost any engineering technique, this approach is a tradeoff with pros and cons. In this case, its drawback is that a relatively short string (S2) can be keeping an otherwise-unreachable other string (S1) alive, which might be much bigger. We have just encountered a real-world use case where this happens a lot and has detrimental effects on memory consumption: user-space JSON parsing. Any string values in the decoded data keep the entire JSON source string in memory when they are represented as string slices. (Note that the specifics of the JSON format are irrelevant, the same would happen for any other string-based serialization format where string values occur as substrings of the serialized data, JSON just happens to be a very common case of this.) For comparison, JavaScript doesn't run into this problem, because the implementation of JSON.parse never creates sliced strings, because it can make a very reasonable assumption that the JSON source string is meant to be thrown away when the parsing operation is complete.

As a short-term mitigation, we have just changed V8's implementation of stringview_wtf16.slice to make full copies when the substring is at most half as long as the original string. A possible longer-term engine-side mitigation might be to build a tracking mechanism across the entire managed heap that figures out which long strings are being used by which slices, and can then decide whether converting these slices to copies would be overall beneficial or not. That's fairly heavyweight machinery for solving a rather limited problem though, so while I'd be excited to see this prototyped, I'm not convinced that it would actually be a reasonable choice for engines to spend the engineering hours to build such machinery and, more importantly, the runtime CPU cycles to operate it.

As a proper solution, it would seem reasonable to me to add a way to let Wasm modules explicitly control the desired behavior:

I don't feel strongly about how this will be solved, I'm just saying that I think it would make sense to have some improvement over the status quo.

Without new instructions, copies of string slices can be forced by converting the string to an array and then back to a string (e.g. using string.encode_wtf16_array + string.new_wtf16_array, or their wtf8 equivalents); engines could conceivably learn to recognize this pattern and shortcut it to avoid the double copy (but not shortcut it too much: if they avoided both copies, that would defeat the point). IMHO this would not be a pretty solution though; it feels similar to dubious JavaScript performance tricks like "temporarily install this object as the prototype of some other dummy object just to make the JS engine to switch its internal representation into a certain mode".

(Same discussion for imported strings: https://github.com/WebAssembly/js-string-builtins/issues/1)

Pauan commented 1 year ago

I am in favor of giving Wasm explicit control. Wasm is low level so it should have control over low-level details like that.

Rust also has a distinction depending on whether the slice copies or not: &str never copies, but String copies.

So the programmer knows whether the string is a copy or a slice, and is able to explicitly convert a slice into a copy (by using the to_string or to_owned methods).

So the prior art is that the distinction is important, and it's important to have manual control over it.