WebAssembly / stringref

Other
37 stars 2 forks source link

[Feature] string.ord or string.lt + string.gt #37

Open MaxGraey opened 2 years ago

MaxGraey commented 2 years ago

I don't think string.eq alone will be enough. For string sorting we also need string.lt (a < b) and string.gt (a > b) operations.

Or ideally replacing string.eq (or at least addition) with the more generalized string.ord which perform lexicographical comparison and return -1 if (a < b), +1 if (a > b) and 0 if (a == b).

WDYT?

wingo commented 2 years ago

Related to #5.

I think the question is really, "enough for what". I'm sympathetic to the use case, but I think that we may be able to get by with something more limited until we have more data indicating that more instructions would be enough (that word again!) of a win that we should standardize them. Three points:

Firstly, note that string.eq is inherently faster for its use case than computing lexicographic sort order; for different-length strings, we know immediately that they are unequal, but we don't know their order until examining their contents. (Of course optimizing compilers could recover the == 0 pattern but it's probably best for this proposal to include more predictable primitives.)

Secondly, I think there may be OK MVP solutions. Just trying to put myself in your shoes here -- If I were implementing a compiler for a language that required lexicographic ordering and I knew that I were targetting the web, I would implement an MVP string.ord by calling out to JS. Easy to implement and runs fast. If I wanted to polyfill in wasm based on the existing stringref MVP, I'd probably call out to an internal helper that makes two iterator views and implements the ordering. I think it's reasonable to hope that escape analysis at the optimizing compiler tier would eventually eliminate the iterator allocations, if iterator view performance ever became a priority. Or, to avoid allocation, I might encode WTF-8 to linear memory (or a cached i8 array) and use memcmp, if that's already part of the runtime.

Finally, as you might start requiring more and more elaborate string algorithms (collation, line breaking, normalization, etc), I'd definitely want to be leaning on some system library, be it another WASI component or JS/Intl. I can see how having string.ord in the stringrefs MVP might be useful but with the broader view I could also see it being quite acceptable from both a performance and maintenance POV in the system's string support library.

MaxGraey commented 2 years ago

I don't think string.ord is as complicated as toUpperCase / toLowerCase. It's a very basic operation just combining the two and could be polyfilled as (a > b) - (a < b). But stringref doesn't provide these operations either. I suggest keeping string.eq but also adding string.ord. Of course the user can create his own implementation of string.ord compare each podepoint individually during iterating whole minimal string. But this would be extremely inefficient. And given that string.ord is likely to be used to sort an array of strings, this would be frustratingly slow. In fact, I see no problem adding string.org to the MVP. It's much easier, considering that many Wasm engines already have similar functionality for JS code

rossberg commented 2 years ago

I believe string ordering on Unicode strings is locale-dependent and generally a pretty complex operation.

MaxGraey commented 2 years ago

string.ord could be locale independent (for local-dependent comparision we need Intl which can be out of scope for now and probably reserve some immediate flag/hint for future). And for latin1 and WTF16 it's mostly similar as memory comparision (which btw also doesn't support in WebAssembly and bulk memory ops). For UTF16/UTF8/WTF8 it may be a little more complicated but still relatively simple. See this article for example: https://icu-project.org/docs/papers/utf16_code_point_order.html