Open lukewagner opened 3 months ago
We could also allow multiple partial reads of the same validx so that a single logical list value could be read into multiple non-contiguous segments (think readv())
Another use case for partial reads:
For UDP sockets, recv
/recvfrom
/recvmsg
are defined to truncate the received contents to fit the provided buffer. Wasi-sockets returns the data as a list<u8>
which is potentially larger than the buffer provided to wasi-libc. At the moment, wasi-libc performs the truncation itself.
If the proposed value.lower
doesn't support partial reads, wasi-libc would still need to perform an intermediate allocation+copy in order to meet the expected truncation semantics.
Neat idea!
I guess this was partially foreshadowed by adding more and more i32 handles representing external data to the canonical ABI over the past year (first resources, then recently async callback contexts and stream read/write handles)
Once MAX_FLAT_PARAMS
/RESULTS
is exceeded, the total number of extra elements is known at code generation time to both caller and callee, so either the caller could allocate the memory ahead of the call and pass it via a retptr
argument - or this is done with a lazy index but without the need to pass the allocation length to the caller. I would prefer the first solution as it saves one extra call. -- At the cost of the host needing to allocate storage for too many parameters and results to exported functions at least once in linear memory via realloc, as the normal host stack isn't accessible by the guest.
With respect to string lengths with non-matching encoding on both sides: The host could eagerly retrieve the lazy value before passing it on, convert it and then pass the calculated length in the new encoding along.
The host could eagerly retrieve the lazy value before passing it on, convert it and then pass the calculated length in the new encoding along.
I suspect the special case handling for strings is because of this (from CanonicalABI.md):
Storing strings is complicated by the goal of attempting to optimize the different transcoding cases. In particular, one challenge is choosing the linear memory allocation size before examining the contents of the string. The reason for this constraint is that, in some settings where single-pass iterators are involved (host calls and post-MVP adapter functions), examining the contents of a string more than once would require making an engine-internal temporary copy of the whole string, which the component model specifically aims not to do.
I.e.: the component model wants to be able to validate+transcode+copy strings in a single pass.
I.e.: the component model wants to be able to validate+transcode+copy strings in a single pass.
I fear that this would require partial lazy read (and still some sort of transient buffering in the host). My bet was on the host having abundant resources compared to individual components. Perhaps the host should communicate the maximum transcoded string length somehow in advance?
It feels a bit like text files suddenly changing their length on read due to CR/LF recoding.
I was ruminating on a proper (Rust) API for lazy value lowering and realized that what I am really looking for is to remove heap allocations from function calls, so returning a stack compatible object (e.g. heapless::Vec<N>
) is the most helpful.
But this would require the code generator to know the maximum number of elements, and the WIT file is the best place to define it. So I guess what I am really looking for is list<u32, ..16>
or string<..32>
(with whatever definition of string length reasonable). These types, like tuple<>
, become embedded in the normal result storage and like fixed length lists #181 never require a realloc call.
Which doesn't mean that I'm not a fan of lazy lowering, I just feel that I should prioritize bounded container implementation instead.
Another use case for partial reads:
Assuming that, in the case of a partial reads/writes, you want to communicate back to the caller how much was read/written, I think then what you want is the buffer types of #369, since they convey this extra semantic information to the caller, and/or, with Preview 3, a stream
.
Once MAX_FLAT_PARAMS/RESULTS is exceeded, the total number of extra elements is known at code generation time to both caller and callee, so either the caller could allocate the memory ahead of the call and pass it via a retptr argument - or this is done with a lazy index but without the need to pass the allocation length to the caller. I would prefer the first solution as it saves one extra call.
For lowered imports, the current CABI uses the retptr
method you mention and with lazy lowering it would continue to (since there's no realloc
in any case). It's only for lifted exports that the current CABI does a realloc
which we could make lazy. But even better (I forgot to mention this in the OP), we could borrow task.start
(from the async
proposal, allowing it to be used with any function that uses the lazy
ABI) and now even lifted exports could pass in a pointer to a statically-sized buffer (as an i32
param to task.start
) in the >MAX_FLAT_PARAMS
case.
I.e.: the component model wants to be able to validate+transcode+copy strings in a single pass.
Yeah, that's the goal.
Perhaps the host should communicate the maximum transcoded string length somehow in advance?
Agreed! This is what I was overly-briefly alluding to in that paragraph about string
.
This isn't a fully-fleshed out idea, but it's coming up in #369 and so I thought it might be nice to split out here. I'm tentatively excited about it, though (in that "this is probably what we should've done in the first place... sigh" sort of way).
Motivation
When lowering a dynamically-sized value (
list
orstring
) or lowering more thanMAX_FLAT_PARAMS
/RESULTS
, the Canonical ABI specifies calling a wasm-defined allocation function (indicated per import/export via therealloc
immediate oncanon lift
/lower
, but usually there's just one exported with the namecabi_realloc
) to get the linear memory to lower thelist
/string
into. This works decently well, but sometimes it's constraining, leading to various ideas for further customizations, with more ideas brewing. But the root problem seems to be the control flow inversion of the adapter controlling the calls torealloc
, which limits wasm's control of the order and manner of allocation.Idea
In all the places where the CABI currently says to call
realloc
and store the resultingi32
pointer, instead we could say that the CABI stores a freshi32
index that refers to the abstract value-to-be-lowered. Thei32
length (oflist
s andstring
s) is stored eagerly as usual. Control flow is then handed to wasm, passing it this shallowly-lowered tuple of params/results. Wasm can now look at thei32
length values to see how much to allocate and then acquire that memory however it wants; if there are multiple values-to-be-lowered, wasm can do smart things like batching allocations.Once the destination address is selected, wasm then calls a new built-in to lower the value, e.g.:
canon value.lower $t: [validx:i32 dstp:i32] -> []
For a
list<list<u8>>
, the first call tovalue.lower
will generate N morevalidx
s for eachlist<u8>
element, which the calling wasm again gets to control the allocation and lowering of (and so on, recursively through the type structure).We could also allow multiple partial reads of the same
validx
so that a single logicallist
value could be read into multiple non-contiguous segments (thinkreadv()
). Once fully lowered, avalidx
will trap if lowered again.string
is the problem child as usual. If the lifting and lowering side agree onstring-encoding
, the lifted number of code-units can be passed directly and tagged as being precise. Otherwise, an "approximate length" can be spec-defined (derived from the lifted+loweredstring-encoding
and lifted number-of-code-units) and passed instead and the wasm code can use repeated partial reads (or perhaps a worst-case allocation and a single read).At the end of the call (for lifted exports functions: when wasm returns, for lowered functions, when wasm calls some
canon finish-call: [] -> []
built-in that triggers the callee'spost-return
), the temporary table of values-to-be-lowered is reset, dropping any unlowered values (optimizing the case of unneeded values).Compatibility
To avoid breaking existing Preview 2 component binaries, we could require components to opt into this new ABI via a new
lazy
canonopt
. In the future, when we make our final breaking binary changes leading up to 1.0, if we thinklazy
is the Best Thing Ever, there should be a simple automatic conversion from non-lazy
intolazy
components (generating little typed-directed wrapper core functions that contain thecabi_realloc
+value.lower
calls), so that we can killrealloc
and makelazy
the only option (or not, if havingrealloc
is convenient).Even better: value forwarding
We could also allow lifting a value-to-be-lowered directly from its
i32
index (using a tag bit to indicate whether thei32
is a pointer or index). This would give us the ability to cheaply "forward" a value through an intermediate component, the need for which I've seen come up on a number of occasions, especially when one component lightly wraps another.(There's some delicate lifetime issues to work out here regarding return values and
post-return
, but I think with the explicitfinish-call
, it can all work out?)Relation to caller-supplied buffer types
This is an idea that I think is complementary to the new caller-supplied buffer types proposed #369; see this comment and preceding comments for how the buffer types address a subtly different (and more specialized/advanced) use case.
(Apologies in advance for no replies over the next 2 weeks; I'll be on vacation.)