WebAssembly / stringref

Other
37 stars 2 forks source link

Algorithmic complexity of instructions #56

Open eqrion opened 1 year ago

eqrion commented 1 year ago

What is the desired algorithmic complexity of the following instructions?

Disclaimer: I understand that the spec says nothing about algorithmic complexity of instructions. However the rough algorithmic complexity of instructions does matter in-practice for whether a module can feasibly run on an implementation.

  1. get_stringview_wtf8
    1. O(n) or O(1)?
    2. If stringref is implemented with a WTF16 encoding, this sounds like it’s O(n) to copy.
    3. If stringref is implemented with a WTF8 encoding, this can be a O(1) no-op.
    4. Is there a fancy way to guarantee O(1) here?
  2. get_stringview_wtf16
    1. O(n) or O(1)?
    2. If stringref is implemented with a WTF16 encoding, this can be a O(1) no-op.
    3. If stringref is implemented with a WTF8 encoding, this sounds like it’s O(n) to copy, or compute breadcrumbs.
    4. Is there a fancy way to guarantee O(1) here?
  3. string.concat
    1. O(n) or O(1)?
    2. Are implementations expected to use ropes to defer concatenation of strings?
  4. string.eq
    1. O(n) or O(1)?
    2. Are implementations expected to use hash-consing to guarantee O(1) comparisons?
  5. string.measure_[utf8/wtf8/wtf16]
    1. O(n) or O(1)?
  6. string.is_usv_sequence
    1. O(n) or O(1)?

It seems like many operations here are either linear or constant time depending on how the host implements stringref (WTF8/WTF16 backed) (with ropes/without) (hash-consed constants or not).

If this is the case, I’m concerned that this could lead to future incompatibilities where modules are written assuming something is constant time, but it’s in fact linear on a host with a different implementation.

jakobkummerow commented 1 year ago

In an ideal world, they'd all be O(1), but the real world isn't always ideal :-)

In particular, when a module creates a situation where an encoding change is unavoidable (such as: creating a string from utf8 data, then measuring its wtf16 length, or vice versa), O(n) cost is unavoidable. What we can hope and should aim for is that avoidable O(n) costs are, indeed, avoided.

The stringref proposal is very consciously designed around accepting this reality. It aims to give engines the freedom to pick from a variety of fundamental implementation techniques as well as from a large bag of optional optimizations to make as many cases as possible as fast as possible. Lazy conversions and flexible internal representations are key tools there, at least in the long run; to get an implementation off the ground, fairly simple implementations are viable as well. At the same time, it aims to give module producers the means to express their performance expectations. That's why view creation is an explicit step: once operating on a view, modules can safely assume that the complexity of any operation is that which you can reasonably expect from a string that's internally represented by the view's specified encoding (e.g. stringview_wtf16.get_codeunit should be O(1)).

This does lead to the situation that it may well be the case in many engines that either get_stringview_utf8 or get_stringview_wtf16 will be O(1), but not both, and engines might not agree with each other in their choice. That's clearly not an "ideal world" situation, but at least it opens up the possibility (or: raises the ceiling) for great performance. That said, in the long run, I can see sufficiently advanced engines support both utf8 and wtf16 encodings internally and only switch on demand (or even store both (lazily created) encodings of strings that need them), so that modules that only use one view type can always expect (amortized) O(1) view creation. (FWIW, we're not doing that in V8 for now, as we haven't seen demand for it yet, and it would be a big project.)

Andy has written up the performance characteristics of the V8 prototype: https://docs.google.com/document/d/1w2jLY7LuMG1grm_u7avtoAqYW1tcvPt4zc5_yNwTRyQ/edit Of course this isn't a normative part of the stringref proposal.

Pragmatically, I'd recommend to start with whatever string representations/operations you already have in Spidermonkey, and only implement additional optimizations/representations over time if the need arises.

Jamesernator commented 1 year ago

Given the whole point of this proposal is that engines/hosts use different implementations for strings, it's not surprising that different types of strings will have different performance characteristics.

Although I do wonder, some languages might well be prepared to compile to handle either kind of string, in which case an instruction for determining the best type of stringview to use would be potentially useful.

i.e. Something like:

(if
    (i32.eq
        ;; Returns a code for preferred encoding
        (stringview.preferred)
        (i32.const $CONST_FOR_UTF8)
    )
    (then
        ;; Use the utf-8 implementation   
    )
    (else
        ;; Use other implementation
    )
)

This could even be parameterized to take a stringref as hosts/engines may well even have multiple kinds of stringref:

(stringview.preferred (local.get $theString))