WebAssembly / component-model

Repository for design and specification of the Component Model
Other
966 stars 79 forks source link

Bounded lists and strings #385

Open cpetig opened 3 months ago

cpetig commented 3 months ago

This extends #181 with the ability to embed bounded strings or lists directly within other structures while maintaining the variable length property.

Some environments require avoiding all memory allocations at runtime. This happens typically in either small embedded environments or safety related products.

To achieve this in a call returning a string or list/vector you typically combine a fixed size buffer with a length indicator into a containing data structure, e.g. see heapless::String or boost::static_string.

Thus I propose to optionally extend the fixed size buffers of #181 as implemented in #384 with a preceding length indicator. This length indicator could be of type u8 for strings/vectors below 256 elements, u16 below 64k and u32 otherwise. Unused elements in the buffer should be considered uninitialized (MaybeUninit in Rust).

This addresses the problem described in my recent comment in #383 .

I propose the WIT syntax string<..32> or list<u32, ..16>, the binary format would need a new type for bounded strings, lists could use a negative len specifier in 0x67 (see #384 ) or a different defvaltype.

lukewagner commented 2 months ago

This is a good idea and I've been considering it too. Assuming we add the readable-/writable-buffer types proposed at the end of #369, are there current use cases you're looking at that would benefit from this over the buffer types? Most use cases I can think of could alternatively use readable-/writable-buffer types just as well, although I can imagine some more complex hypothetical scenarios that couldn't.

cpetig commented 2 months ago

Whether lazy value lowering can support returning into pre-allocated buffers equally well is a good question.

I know that if caller and callee share the same maximum size the string/list can never overflow the provided buffer on the callee side, simplifying the code considerably.

But what really sold me to this solution is the ease of use on the caller API side. E.g. f: func() -> string<..16>; can be used as let value = f(); (value being of type heapless::String<16> instead of separately providing a buffer and handling overflow. I simply have no idea how to call lazy value lifting in Rust or C++ with the same elegance.

lukewagner commented 2 months ago

Yes, good point; bounded lists/strings would have the ideal source bindings for languages with facilities for inline allocation.

In the motivation given in your first post, you mentioned the common pattern in C-style APIs of transferring an arbitrary-sized value by repeatedly calling a function and transferring it out chunk at a time. But for that specific case, I would think the writable-buffer type would be the better choice in that it both: (1) gives the caller control of the buffer size (instead of fixing it in the WIT once and for all), (2) provides a pointer to the linear memory up-front as a parameter which, depending on the host implementation, may remove a copy.

However, that's not the only use case for bounded lists/strings, so I mostly wanted to ask, for prioritization reasons, if you had other specific use cases in mind (in WASI or other WIT interfaces you're looking at).

cpetig commented 2 months ago

My primary motivation for working on this is the (functional safety) requirement of deterministic execution time. The most straightforward solution is with dedicated pre-allocated buffers provided by the caller.

Return value optimization can enable this even via the very friendly return notation (which internally maps to an output parameter for non atomic types).

I feel with #384 ready (and having its own merit) and this being just a small extension this would be the lowest hanging fruit, given that stream and writable-buffer feel much less possible to finalize within this year.

lukewagner commented 2 months ago

Right, doing this as a small addition to #384 makes sense and I think the feature makes sense in general. In terms of scheduling/prioritization, you're right that stream will take a while, but I had been thinking that maybe it'd make sense to work on readable/writable buffers as the very next thing. That being said, fixed+bounded-length lists are definitely the smaller/simpler feature and so we could have them ready somewhat sooner -- but I wonder whether the (embedded and other) use cases that want "caller-supplied buffers" will be satisfied or will simply immediately turn to wanting readable/writable buffers.

And just, concretely, if we did have bounded-length lists, but not yet buffers, to avoid realloc with read, we'd need something like:

  read: func(n: u64) -> result<list<u8; ..128>>;

and I wonder if that hard-coded constant (whatever we choose) will seem too ad hoc and imminently-meant-to-be-deprecated to actually ship as a whole WASI release, knowing that what read really wants is:

  read: func(buf: writable-buffer<u8>) -> result;

OTOH, depending on the timings involved, it could be practically worth it to ship the former before the latter. ¯\_(ツ)_/¯

cpetig commented 2 months ago

To be honest I never had adding limits to WASI in mind when I thought of this feature, but more specialized custom WIT interfaces which are fully controlled by the runtime designer (or used between two components where the interface can be unilaterally defined by a single party).

I get that this intention defeats the purpose of a generic standardized interface, which is exactly what WASI is about. But for now bounded lists seem the best option for user friendly zero allocation interfaces.

lukewagner commented 2 months ago

I see, thanks for explaining. Given that #181 was primarily waiting for implementation feedback before merging and this is a relatively small addition, and you've helpfully created #384 as a rebase of #181, would you like to also add this idea to #384? Given both options, perhaps the WAT syntax we want is (list u8 (eq 4)) and (list u8 (lt 4)) for list<u8; 4> and list<u8; ..4> resp.

cpetig commented 2 months ago

Oh, wow, this sounded like a small thing to do, but I found that the following things make it really large:

Also I went for an additional byte in the 0x67 encoding to distinguish between fixed lists, bounded lists and bounded strings.

You can find the current prototype at https://github.com/cpetig/component-model/tree/bounded-lists . I will continue the work, but it will take longer than originally expected.

lukewagner commented 2 months ago

Oh right, I forgot to get back to strings. Because of the encoding issues, I think perhaps we should not attempt to extend strings. Instead, if an API wants to do this bounded-size thing, it can just pick a particular encoding (as part of the API's contract) and use a list<u8; ..N>. Otherwise, we'd need string<..K>, where K is necessarily the number of Unicode Scalar Values, to imply a K*4-byte buffer which would be a lot more wasteful than a UTF-8-specialized list<u8; ..K>, which simply implies a K-byte buffer.

cpetig commented 2 months ago

I don't think there is a perfect solution for strings due to encoding variation, but taking the maximum length as the number of u8s in utf8 and number of u16s in any of the two utf16 encodings, there is still some possibility of overflow when recoding between components - but the bound has an easy to understand meaning and if both components have the same encoding the bound is identical, so no overflow is possible in the more common case.

At least this was the most sensible solution I came up with.

badeend commented 2 months ago

In addition to an upper bound, could we also add a lower bound, like list<u8, 1..255>?

lukewagner commented 2 months ago

Sounds reasonable to me.

cpetig commented 2 months ago

I understand that from a contract design standpoint a lower bound makes sense, but I can't come up with a practical example where it would be beneficial. @badeend Do you have a specific case in mind?

(different from the obvious lower bound of 1 for lists?)

badeend commented 1 month ago

Indeed to indicate a list may not be empty:


https://github.com/WebAssembly/wasi-io/blob/285a31442f687a5ceee86044f9f2b0a9e2beae21/wit/poll.wit#L34C4-L46C57

    /// This function traps if either:
    /// - the list is empty, or:
    /// - the list contains more elements than can be indexed with a `u32` value.
    ///
    /// A timeout can be implemented by adding a pollable from the
    /// wasi-clocks API to the list.
    ///
    /// This function does not return a `result`; polling in itself does not
    /// do any I/O so it doesn't fail. If any of the I/O sources identified by
    /// the pollables has an error, it is indicated by marking the source as
    /// being ready for I/O.
    @since(version = 0.2.0)
    poll: func(in: list<borrow<pollable>>) -> list<u32>;

Better expressed as:

    poll: func(in: list<borrow<pollable>, 1..>) -> list<u32>;
// or even:
    poll: func(in: list<borrow<pollable>, 1..4294967295>) -> list<u32>;
// :P

https://github.com/badeend/wasi-sockets/blob/8f3b082e6ead0eebbcedf10d8a06d16dffbaa68d/wit/tls.wit#L30-L35

    /// Application-Layer Protocol Negotiation (ALPN) protocol ID.
    ///
    /// ALPN IDs are between 1 and 255 bytes long. Typically, they represent the
    /// binary encoding of an ASCII string (e.g. `[0x68 0x32]` for `"h2"`),
    /// though this is not required.
    type alpn-id = list<u8>;

Better expressed as:

    type alpn-id = list<u8, 1..255>;