WebAssembly / wasi-nn

Neural Network proposal for WASI
470 stars 36 forks source link

Add a `bytes` tensor type #55

Open abrown opened 1 year ago

abrown commented 1 year ago

Some models accept tensors whose items are bytes. In order to add these to enum tensor-type, we need to figure out how to represent these bytes items as tensor data, which is currently a u8 array:

https://github.com/WebAssembly/wasi-nn/blob/747d8dfb389e061eeb52a518aafe777dbf70bdda/wit/wasi-nn.wit#L44

Imagine the situation where a model's input is a 1x10 tensor of bytes; this means 10 byte arrays need to be stored in the tensor data section. Unfortunately, these byte arrays could all be of different sizes; how should the specification handle this? Some options:

There might be other options — let's discuss them in this issue. @geekbeast has floated the idea that tensor data should be represented a list<list<u8>>: this way we can use the WIT/WITX type system for encoding each of the lengths of the bytes arrays. This has some problems: (1) what about tensors with more dimensions? We don't know how many list<...> wrappers we need. (2) This representation doesn't fit other tensor types well: e.g., we don't need to know that f32 is a 4-byte list<u8>. (3) Coercing tensors into a specific WIT/WITX type could involve some copying; ideally we just want to be able to pass some pre-existing bytes (e.g., from a decoded image) as tensor data without additional overhead.

Your feedback is appreciated to figure this out!

geekbeast commented 1 year ago

The initial support can be for 1x1 tensors as you can then provide multiple 1x1 input tensor to a model if you really need to get it to work.

I think the longer term solution is the idea we discussed, but I want to clarify it slightly as what you described above isn't quite what I had in mind. I am proposing we handle it the same way most of the ML frameworks handle it. More specifically, most frameworks allow passing a linear array that is re-interpreted based on the shape parameter. You can apply the same mapping to a list<list> where each element in the list represents the element being mapped.

A simple example is 25 element list<list> being mapped to a 5 x 5 tensor by row-major mapping. The only thing that wit needs to preserve is the length of the linear array. This has the benefits of being in line with how other frameworks handle mapping dimensions and it fits all tensor types well (especially if we expand set_input so that it has forms for primitive types).

I think (3) is an overly strong assumption and there are currently issues in how we handle this as well. For example, WIT currently does a copy when passing in a tensor and since we lack a set_input_f32(...,vec<f32>) it will be an unaligned copy that is then passed into embedded libraries that may or may not validate alignment (and if they do validate will have to fix it).

abrown commented 1 year ago

I am proposing we handle it the same way most of the ML frameworks handle it. More specifically, most frameworks allow passing a linear array that is re-interpreted based on the shape parameter. You can apply the same mapping to a list where each element in the list represents the element being mapped.

I agree that tensor-data should be a linear array (as it currently is: list<u8>) and I even documented how it could use row-major ordering (here). The tensor-type is what is used to "reinterpret" the tensor-data. So nothing new there.

The problem with changing the type of tensor-data to list<list<u8>> is that it only applies to two-dimensional tensors. For a 5x5x5 tensor, tensor-data would need a new type, list<list<list<u8>>>. And so on down the line. But, IIUC, this kind of "multiple possible types" for tensor-data is not possible in WIT.

Did I understand what you were getting at correctly? If so, we probably need to look at some of the other options above instead of changing the type of tensor-data.

shschaefer commented 1 year ago

Tensors are n-dimensional and can have several discerning factors beyond type and shape. In GPU memory environments, there may also commonly be strides. Images and video have channel and depth formats describing the shape. Leaving tensor-data as a linear array is simple and efficient (as long as we also solve the lack of bindings to avoid copies).

The current tensor-type does not supply any additional metadata. Channel format, for example, would have to be implicit by the author selecting appropriately for framework and device unless the model is built format-agnostic, or the tensor metadata exposed desired shape which could then imply format.

geekbeast commented 1 year ago

That is not what I was getting at all in your 5x5x5 tensor example. The simplest way I can put it is that list<list> is the linear array for the bytes type.

Mostly agree with @shschaefer. The channel and depth formats determine the shape, but once the shape is determined that will determine the number of bytes in the linear array. The encoding and arrangement of those bytes maybe impacted by what the model expects, but as you point out we lack a sufficiently detailed tensor metadata to convey that information to the underlying system. This convinces me that we should just leave the API as is for now and leave it up to each implementation to decide how it wants to encode that information into the input tensor.

This implicitly means we won't be able to make changes like range checks i.e if shape is 5x5x5 then array should be 125, because there is no guarantee the number of bytes is connected to the shape of the tensor.

mingqiusun commented 1 year ago

Transformer networks for LLM take input sequences with a fixed length. So in that regard our current wasi-nn spec is sufficient. However, the data preprocessing part where text of arbitrary length is tokenzied with padding and truncation is not covered in the spec. We had similar issues with image classification on converting image to tensor and the need for some helper functions/meta data. We talked about options of incorporating that into spec, or leave it to implementers in SDK. Maybe we should revisit this topic?

abrown commented 1 year ago

I think I have a new suggestion to add to this thread: why don't we use tensor-dimensions to describe the length of the bytes arrays laid out end-to-end in the tensor-data field?

The problem here is how to communicate the lengths of these variable-length bytes tensors to the backend framework. Just to recap, we've discussed several ideas for this:

  1. not allow bytes tensors; this does not seem like a valid option since we expect to support LLMs
  2. encode the lengths of the bytes arrays within the tensor-data field: this introduces a new encoding/decoding protocol that I expect we would need to specify in the spec and implement in various places — not great.
  3. add new set_input_* methods for each data type; this means that a user would use set_input_f32(data: list<f32>) to create a tensor of f32s and would use set_input_bytes(data: list<list<u8>>) to create a tensor of bytes. @geekbeast likes this approach since he feels it also solves another problem, alignment; it's a large overhaul of the API surface, though, and there could be usability and performance implications to think through
  4. use the tensor-dimensions with the maximum bytes length; if we need a 42-byte bytes tensor, we place the 42 bytes in tensor-data and use tensor-dimensions: [1, 42] (1x42). The problem with this shows up when multiple bytes arrays need to fit into one tensor (e.g., batching?): if we have three of them and 42 is the longest array, we would write tensor-dimensions: [3, 42] (3, 42) and then tell users to pad each of the smaller arrays with some bytes — not great for several reasons.

I would like to propose a new, fifth option, based on the fourth:

  1. use tensor-dimensions but change its meaning slightly for bytes: for other types, tensor-dimensions would represent the dimensions of a matrix (MxNx...) but for bytes, tensor-dimensions would represent the size of each byte array contained in the tensor-data field. We document that, for bytes, users should lay out their bytes arrays end-to-end in one long list<u8> attached to tensor-data. Then, construct a tensor-dimensions based on the size of each of those arrays. For example:

    • for a single bytes array of length 42: tensor-dimensions: [42] and tensor-data: [<42 bytes>]
    • for three bytes arrays of lengths 3, 42, and 17: tensor-dimensions: [3, 42, 17] and tensor-data: [<3 bytes> <42 bytes> <17 bytes>]
geekbeast commented 1 year ago

As @mingqiusun said LLMs take fixed sized size tokens in a sequence. This is only for supporting frameworks that have string or object dtype for the tensor.

geekbeast commented 1 year ago

Your new suggestion only supports 1 x N dimensional tensors and not arbitrary shapes. It's probably still possible to embed that information into the shape field-- for example say first entry in shape is length of tensor dimensions, followed by tensor dimensions, and the rest has to be mapped as length of bytes in row major order, but this means you will need language bindings to make it reasonable to use as that is some complicated logic to construct those calls.

It might be better to do nothing than introduce this much mapping complexity for what is probably a less common dtype.

geekbeast commented 1 year ago

Quick correction on (3), I think we could solve just this issue with the minimal amount of complexity by adding list<list> or list but my preference is to make no changes as it isn't required for generative AI and it's not clear that those dtypes are broadly used in the ecosystem. Making no change seems inline with what the other solutions are attempting to accomplish, but avoids introducing additional complexity for something we may not even want to support. Also think the energy going into the bytes discussion would be more fruitful going into tensor metadata along the lines of what @shschaefer brought up.

@squillace This seems most relevant to the ONNX backend. You all probably have a better idea of how frequently the string dtype is used by models. Do you all see the need for this additional complexity to shoehorn in non-fixed sized dtypes?

mingqiusun commented 1 year ago

Usually, an LLM model expects input tensors in fixed shapes such as [batch, sequence, feature]. This maps well to our current spec for tensors. Maybe what is needed is a helper function such as text2tensor? But the challenge is that this process of conversion is highly customizable, not exposed by all frameworks and hard to standardize.

geekbeast commented 1 year ago

LLMs are not the justification for adding this type unless you are trying to support ggml string input or wrapper models around LLMs. This still reduces to the question of what dtypes should we support? Without any principles this seems like a fairly arbitrary decision that should just be made, until a compelling use case demonstrates otherwise.

Complicated encoding schemes to shoehorn into current ABI add a lot of baggage to the spec and means they have to be supported for a very long time. Better to do nothing than lock in some brittle limited pattern forever.

abrown commented 1 year ago

@geekbeast's use case was the original motivation for this issue; since he feels like he can make do without a new type, let's park this issue until someone absolutely needs it. Like he mentions above, we don't want to lock in some "complicated encoding scheme," so some caution is warranted here. If anyone does end up looking at this in the future, my current take is that options 3 (set_input_*) and 5 (tensor-dimensions: MxNx...) seem the best bets, though neither are perfect.