Open abrown opened 1 year ago
The initial support can be for 1x1 tensors as you can then provide multiple 1x1 input tensor to a model if you really need to get it to work.
I think the longer term solution is the idea we discussed, but I want to clarify it slightly as what you described above isn't quite what I had in mind. I am proposing we handle it the same way most of the ML frameworks handle it. More specifically, most frameworks allow passing a linear array that is re-interpreted based on the shape parameter. You can apply the same mapping to a list<list
A simple example is 25 element list<list
I think (3) is an overly strong assumption and there are currently issues in how we handle this as well. For example, WIT currently does a copy when passing in a tensor and since we lack a set_input_f32(...,vec<f32>)
it will be an unaligned copy that is then passed into embedded libraries that may or may not validate alignment (and if they do validate will have to fix it).
I am proposing we handle it the same way most of the ML frameworks handle it. More specifically, most frameworks allow passing a linear array that is re-interpreted based on the shape parameter. You can apply the same mapping to a list
where each element in the list represents the element being mapped.
I agree that tensor-data
should be a linear array (as it currently is: list<u8>
) and I even documented how it could use row-major ordering (here). The tensor-type
is what is used to "reinterpret" the tensor-data
. So nothing new there.
The problem with changing the type of tensor-data
to list<list<u8>>
is that it only applies to two-dimensional tensors. For a 5x5x5 tensor, tensor-data
would need a new type, list<list<list<u8>>>
. And so on down the line. But, IIUC, this kind of "multiple possible types" for tensor-data
is not possible in WIT.
Did I understand what you were getting at correctly? If so, we probably need to look at some of the other options above instead of changing the type of tensor-data
.
Tensors are n-dimensional and can have several discerning factors beyond type and shape. In GPU memory environments, there may also commonly be strides. Images and video have channel and depth formats describing the shape. Leaving tensor-data
as a linear array is simple and efficient (as long as we also solve the lack of bindings to avoid copies).
The current tensor-type
does not supply any additional metadata. Channel format, for example, would have to be implicit by the author selecting appropriately for framework and device unless the model is built format-agnostic, or the tensor metadata exposed desired shape which could then imply format.
That is not what I was getting at all in your 5x5x5 tensor example. The simplest way I can put it is that list<list
Mostly agree with @shschaefer. The channel and depth formats determine the shape, but once the shape is determined that will determine the number of bytes in the linear array. The encoding and arrangement of those bytes maybe impacted by what the model expects, but as you point out we lack a sufficiently detailed tensor metadata to convey that information to the underlying system. This convinces me that we should just leave the API as is for now and leave it up to each implementation to decide how it wants to encode that information into the input tensor.
This implicitly means we won't be able to make changes like range checks i.e if shape is 5x5x5 then array should be 125, because there is no guarantee the number of bytes is connected to the shape of the tensor.
Transformer networks for LLM take input sequences with a fixed length. So in that regard our current wasi-nn spec is sufficient. However, the data preprocessing part where text of arbitrary length is tokenzied with padding and truncation is not covered in the spec. We had similar issues with image classification on converting image to tensor and the need for some helper functions/meta data. We talked about options of incorporating that into spec, or leave it to implementers in SDK. Maybe we should revisit this topic?
I think I have a new suggestion to add to this thread: why don't we use tensor-dimensions
to describe the length of the bytes
arrays laid out end-to-end in the tensor-data
field?
The problem here is how to communicate the lengths of these variable-length bytes
tensors to the backend framework. Just to recap, we've discussed several ideas for this:
bytes
tensors; this does not seem like a valid option since we expect to support LLMsbytes
arrays within the tensor-data
field: this introduces a new encoding/decoding protocol that I expect we would need to specify in the spec and implement in various places — not great.set_input_*
methods for each data type; this means that a user would use set_input_f32(data: list<f32>)
to create a tensor of f32
s and would use set_input_bytes(data: list<list<u8>>)
to create a tensor of bytes
. @geekbeast likes this approach since he feels it also solves another problem, alignment; it's a large overhaul of the API surface, though, and there could be usability and performance implications to think throughtensor-dimensions
with the maximum bytes
length; if we need a 42-byte bytes
tensor, we place the 42 bytes in tensor-data
and use tensor-dimensions: [1, 42]
(1x42). The problem with this shows up when multiple bytes
arrays need to fit into one tensor (e.g., batching?): if we have three of them and 42 is the longest array, we would write tensor-dimensions: [3, 42]
(3, 42) and then tell users to pad each of the smaller arrays with some bytes — not great for several reasons.I would like to propose a new, fifth option, based on the fourth:
use tensor-dimensions
but change its meaning slightly for bytes
: for other types, tensor-dimensions
would represent the dimensions of a matrix (MxNx...
) but for bytes
, tensor-dimensions
would represent the size of each byte array contained in the tensor-data
field. We document that, for bytes
, users should lay out their bytes
arrays end-to-end in one long list<u8>
attached to tensor-data
. Then, construct a tensor-dimensions
based on the size of each of those arrays. For example:
bytes
array of length 42: tensor-dimensions: [42]
and tensor-data: [<42 bytes>]
bytes
arrays of lengths 3, 42, and 17: tensor-dimensions: [3, 42, 17]
and tensor-data: [<3 bytes> <42 bytes> <17 bytes>]
As @mingqiusun said LLMs take fixed sized size tokens in a sequence. This is only for supporting frameworks that have string or object dtype for the tensor.
Your new suggestion only supports 1 x N dimensional tensors and not arbitrary shapes. It's probably still possible to embed that information into the shape field-- for example say first entry in shape is length of tensor dimensions, followed by tensor dimensions, and the rest has to be mapped as length of bytes in row major order, but this means you will need language bindings to make it reasonable to use as that is some complicated logic to construct those calls.
It might be better to do nothing than introduce this much mapping complexity for what is probably a less common dtype.
Quick correction on (3), I think we could solve just this issue with the minimal amount of complexity by adding list<list
@squillace This seems most relevant to the ONNX backend. You all probably have a better idea of how frequently the string dtype is used by models. Do you all see the need for this additional complexity to shoehorn in non-fixed sized dtypes?
Usually, an LLM model expects input tensors in fixed shapes such as [batch, sequence, feature]. This maps well to our current spec for tensors. Maybe what is needed is a helper function such as text2tensor? But the challenge is that this process of conversion is highly customizable, not exposed by all frameworks and hard to standardize.
LLMs are not the justification for adding this type unless you are trying to support ggml string input or wrapper models around LLMs. This still reduces to the question of what dtypes should we support? Without any principles this seems like a fairly arbitrary decision that should just be made, until a compelling use case demonstrates otherwise.
Complicated encoding schemes to shoehorn into current ABI add a lot of baggage to the spec and means they have to be supported for a very long time. Better to do nothing than lock in some brittle limited pattern forever.
@geekbeast's use case was the original motivation for this issue; since he feels like he can make do without a new type, let's park this issue until someone absolutely needs it. Like he mentions above, we don't want to lock in some "complicated encoding scheme," so some caution is warranted here. If anyone does end up looking at this in the future, my current take is that options 3 (set_input_*
) and 5 (tensor-dimensions: MxNx...
) seem the best bets, though neither are perfect.
Some models accept tensors whose items are
bytes
. In order to add these toenum tensor-type
, we need to figure out how to represent thesebytes
items as tensor data, which is currently au8
array:https://github.com/WebAssembly/wasi-nn/blob/747d8dfb389e061eeb52a518aafe777dbf70bdda/wit/wasi-nn.wit#L44
Imagine the situation where a model's input is a
1x10
tensor ofbytes
; this means 10 byte arrays need to be stored in the tensor data section. Unfortunately, these byte arrays could all be of different sizes; how should the specification handle this? Some options:1x1
tensors are possible withbytes
or something of that nature1x10xN
There might be other options — let's discuss them in this issue. @geekbeast has floated the idea that tensor data should be represented a
list<list<u8>>
: this way we can use the WIT/WITX type system for encoding each of the lengths of thebytes
arrays. This has some problems: (1) what about tensors with more dimensions? We don't know how manylist<...>
wrappers we need. (2) This representation doesn't fit other tensor types well: e.g., we don't need to know thatf32
is a 4-bytelist<u8>
. (3) Coercing tensors into a specific WIT/WITX type could involve some copying; ideally we just want to be able to pass some pre-existing bytes (e.g., from a decoded image) as tensor data without additional overhead.Your feedback is appreciated to figure this out!