Open jakobkummerow opened 1 year ago
I am in favor of giving Wasm explicit control. Wasm is low level so it should have control over low-level details like that.
Rust also has a distinction depending on whether the slice copies or not: &str
never copies, but String
copies.
So the programmer knows whether the string is a copy or a slice, and is able to explicitly convert a slice into a copy (by using the to_string
or to_owned
methods).
So the prior art is that the distinction is important, and it's important to have manual control over it.
The
stringview_wtf16.slice
operation creates a string S2 that's a substring of another string S1. V8 originally implemented this by reusing its existing machinery for creating string slices, i.e. S2 would only store a pointer to S1, a start offset, and a length, making theslice
operation itself very cheap because it doesn't need to copy any characters, and making S2 very memory-efficient (especially when it's long, so that many characters are shared with S1).Like almost any engineering technique, this approach is a tradeoff with pros and cons. In this case, its drawback is that a relatively short string (S2) can be keeping an otherwise-unreachable other string (S1) alive, which might be much bigger. We have just encountered a real-world use case where this happens a lot and has detrimental effects on memory consumption: user-space JSON parsing. Any string values in the decoded data keep the entire JSON source string in memory when they are represented as string slices. (Note that the specifics of the JSON format are irrelevant, the same would happen for any other string-based serialization format where string values occur as substrings of the serialized data, JSON just happens to be a very common case of this.) For comparison, JavaScript doesn't run into this problem, because the implementation of
JSON.parse
never creates sliced strings, because it can make a very reasonable assumption that the JSON source string is meant to be thrown away when the parsing operation is complete.As a short-term mitigation, we have just changed V8's implementation of
stringview_wtf16.slice
to make full copies when the substring is at most half as long as the original string. A possible longer-term engine-side mitigation might be to build a tracking mechanism across the entire managed heap that figures out which long strings are being used by which slices, and can then decide whether converting these slices to copies would be overall beneficial or not. That's fairly heavyweight machinery for solving a rather limited problem though, so while I'd be excited to see this prototyped, I'm not convinced that it would actually be a reasonable choice for engines to spend the engineering hours to build such machinery and, more importantly, the runtime CPU cycles to operate it.As a proper solution, it would seem reasonable to me to add a way to let Wasm modules explicitly control the desired behavior:
slice_allow_shallow
andslice_force_copy
(illustrative names, to be bikeshedded).I don't feel strongly about how this will be solved, I'm just saying that I think it would make sense to have some improvement over the status quo.
Without new instructions, copies of string slices can be forced by converting the string to an array and then back to a string (e.g. using
string.encode_wtf16_array
+string.new_wtf16_array
, or their wtf8 equivalents); engines could conceivably learn to recognize this pattern and shortcut it to avoid the double copy (but not shortcut it too much: if they avoided both copies, that would defeat the point). IMHO this would not be a pretty solution though; it feels similar to dubious JavaScript performance tricks like "temporarily install this object as the prototype of some other dummy object just to make the JS engine to switch its internal representation into a certain mode".(Same discussion for imported strings: https://github.com/WebAssembly/js-string-builtins/issues/1)