Samsung / ONE

On-device Neural Engine
Other
429 stars 152 forks source link

[luci] Shape inference for operators whose inputs have dynamic shape #13697

Open jinevening opened 3 weeks ago

jinevening commented 3 weeks ago

What

Let's support shape inference for operators whose inputs have dynamic shape.

I've made a list of operators to support dynamic-shaped LLM inference.

First milestone (for token gen model)

Op assignee
concat  @jinevening
transpose  Already supported
batchmatmul @zetwhite
div  @jinevening
add  @jinevening
softmax  @jinevening

Second milestone (for the whole (prompt parsing, token gen) model)

Op assignee
mul  @qsunki
fully_connected @Hanjin-Choi
rsqrt  @pcs1265
quantize  @kyeong8139
reshape  @jongwonyang
stridedslice @qsunki
neg  @Hanjin-Choi
logistic  @jongwonyang

Others (for any other issues)

Op assignee
pad  @icodo98
range @kyeong8139

Why

Emerging models (ex: LLM) require dynamic shape tensors (mainly to deal with varying sequence length). But, the current shape inference logic for circle does not correctly handle dynamic shape. Let's extend existing shape inference rules to handle dynamic shape tensors.

jinevening commented 3 weeks ago

To anyone interested in this issue: I've heard that SSAFY(Samsung Software Academy For Youth) mentors (@shs-park and @zetwhite) are interested in the second milestone of this issue. Please do not assign yourself to the second milestone.

zetwhite commented 3 weeks ago

I assigned myself to batchmatmul to understand this issue well.

jinevening commented 3 weeks ago

Before implementing shape inference logic for each operator, we have to answer the below question.

Can we assume unknown dimension is valid?

Let's consider a concat operator with two inputs.

[1,4,3,1] [1,4,?,1]

If axis of concat is 2, we can simply say that the output shape will be [1,4,?,1].

But if axis of concat is 1, what would be the result? I think there are three choices. (1) [1,8,3,1] (2) [1,8,?,1] (3) Stop shape inference

(1) assumes ? is 3, which is the only valid value according to the semantics of concat. (2) just propagates the unknown dimension to the same dimension of the output. (3) seems too much to me. Actually we can guess shape of the output.

I'd like to go with (1), because a proper backend (which can see the actual value of ?) should raise an error for the wrong input. For example, if a user gives a wrong-shaped tensor (? != 3), an error message will be printed like "non-axis dimension of all inputs must be the same". So, the only possible shape of the output is [1,8,3,1].

@Samsung/one_compiler Please leave comments about this issue if you have any other suggestions.


@mhs4670go I found that xla also goes with the approach (1). PTAL c1 in https://github.com/tensorflow/tensorflow/blob/v2.17.0/third_party/xla/xla/service/shape_inference.cc#L280-L292

seanshpark commented 3 weeks ago

Please leave comments about this issue if you have any other suggestions.

I'm OK with @jinevening 's conclusion.

Maybe @Samsung/one_onert can give how it is treated in runtime?

jinevening commented 3 weeks ago

I've heard that shape inference is done after an actual input is given. @hseok-oh @ragmani Or, does onert have inference logic for dynamic shape?

seanshpark commented 3 weeks ago

(1) [1,8,3,1] (2) [1,8,?,1]

Have you checked torch or tensorflow for this case?

jinevening commented 3 weeks ago

I found that xla also goes with the approach (1)

XLA is a compiler for tensorflow. I didn't check torch.

ragmani commented 3 weeks ago

@jinevening

I've heard that shape inference is done after an actual input is given. @hseok-oh @ragmani Or, does onert have inference logic for dynamic shape?

Yes, onert has inference logic that infers shapes of outputs of nodes from first dynamic tensor sequentially. An actual input can be a candidate for the first dynamic tensor.

mhs4670go commented 3 weeks ago

FYI, here is the tflite conversion example.

Codes

import tensorflow as tf

x1 = tf.keras.Input(shape=(2,None))
x2 = tf.keras.Input(shape=(2,3))
concat = tf.keras.layers.Concatenate(axis=1)([x1, x2])

model = tf.keras.models.Model(inputs=[x1, x2], outputs=concat)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
  f.write(tflite_model)

TFL Dump

Operands: T(subgraph index : tensor index) TYPE (shape) (shape_signature) B(buffer index) (variable) OperandName
T(0:0) FLOAT32 (1, 2, 1) (-1, 2, -1) B(1) serving_default_input_1:0

T(0:1) FLOAT32 (1, 2, 3) (-1, 2, 3) B(2) serving_default_input_2:0

T(0:2) FLOAT32 (1, 4, 3) (-1, 4, 3) B(3) PartitionedCall:0
ragmani commented 3 weeks ago

@jinevening @seanshpark

Please leave comments about this issue if you have any other suggestions.

I'm OK with @jinevening 's conclusion.

Maybe @Samsung/one_onert can give how it is treated in runtime?

~In your concat example, onert thorws an exception.~ onert supports concat op with dynamic shape only when all dimensions are equal except for concat axis dimension. https://github.com/Samsung/ONE/blob/8188194d4f4ba135227370ea260400fee0c206e9/runtime/onert/core/src/util/ShapeInference.cc#L312

However, I'm not sure if compiler treats it such as onert. In onert, all shapes of node inputs have already been determined(static) when inferencing outputs for each node. So, onert can support the above concat example if an input has unknown dimension. If the inferred dimension is equal, onert works well. If it is not equal, onert throws an exception.

seanshpark commented 3 weeks ago

from @mhs4670go 's example, and a little bit of change in @jinevening 's case,

two inputs with Concat index of 1,

this becomes

@mhs4670go , am I correct?

mhs4670go commented 3 weeks ago

@seanshpark Yes. That's right.

glistening commented 3 weeks ago
  1. I don't fully understand how this issue is related to onert. It is about luci. Why luci is going to support dynamic-shaped LLM inference?
    • a) to check correctness of generated circle? Or
    • b) to generate circle?
      • Is luci's inferred shape used to generate circle? (I am guessing luci is used to determine shape in generating circle)
  2. For LLM, especially transformer-based model, only a small part of model has a dynamic shape, which is increase by 1 for every token generation. I don't think we need to put much effort to handle every possible shapes.
glistening commented 3 weeks ago

I talked with @jinevening offline.

By to support dynamic-shaped LLM inference, @jinevening means

① Generating circle (FrontEnd) + ② Executing circle (Runtime)

At first, I considered LLM inference means circle execution, and luci has luci-interpreter. Thus, I thought this issue aims to run circle model using luci-interpreter.

In short,

luci (not luci-interpreter) is needed to generate circle. onert will run the generated circle.