Samsung / ONE

On-device Neural Engine
Other
411 stars 144 forks source link

Let's find reference model for RNN support #8747

Open chunseoklee opened 2 years ago

chunseoklee commented 2 years ago

For milestone in https://github.com/Samsung/ONE/projects/9#card-79474017

Candidate 1

one-cmds pythorch (or ONNX) LSTM op import fails · Issue #8217

Candidate 2

based on https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/examples/experimental_new_converter/Keras_LSTM_fusion_Codelab.ipynb

We can generate other RNN model like SimpleRNN, LSTM, and GRU. Here is a example code to generate with GRU :

# !pip install tensorflow==2.7.0
import numpy as np
import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(28, 28), name='input'),
    tf.keras.layers.GRU(20, time_major=False, return_sequences=True),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='output')
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

run_model = tf.function(lambda x: model(x))
# This is important, let's fix the input size.
BATCH_SIZE = 1
STEPS = 28
INPUT_SIZE = 28
concrete_func = run_model.get_concrete_function(
    tf.TensorSpec([BATCH_SIZE, STEPS, INPUT_SIZE], model.inputs[0].dtype))

# model directory.
MODEL_DIR = "keras_lstm"
model.save(MODEL_DIR, save_format="tf", signatures=concrete_func)

converter = tf.lite.TFLiteConverter.from_saved_model(MODEL_DIR)
tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
  f.write(tflite_model)

Candidate 3

based on pytorch tutorial :

chunseoklee commented 2 years ago

Considering the objective in https://github.com/Samsung/ONE/projects/9( "RNN Model with single while loop of non-dynamic tensor" ), while loop is required.

seanshpark commented 2 years ago

With above pytorch tutorials I could prepare a simple encoder model with below script

import torch
import torch.onnx
import onnx

torch.manual_seed(1)

class SimpleEncoder(torch.nn.Module):
    def __init__(self, hidden_size, n_layers=1):
        super(SimpleEncoder, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.lstm = torch.nn.LSTM(hidden_size, hidden_size, n_layers)

    def forward(self, input_seq, input_lengths, hidden=None):
        outputs, hidden = self.lstm(input_seq, hidden)
        return outputs, hidden

n_layers = 1
hidden_size = 16
encoder = SimpleEncoder(hidden_size, n_layers);

inputs = torch.randn(n_layers, 2, hidden_size)
print("inputs =", inputs)

h0 = torch.randn(n_layers, 2, hidden_size)
c0 = torch.randn(n_layers, 2, hidden_size)
outputs, (hn, cn) = encoder(inputs, 1, (h0, c0))
print("outputs =", outputs)
print("hn =", hn)
print("cn =", cn)

input_names = ["input", "h0", "c0"]
output_names = ["output", "hn", "cn"]

torch.onnx.export(encoder,
                  (inputs, (h0, c0)),
                  "simple_encoder_01.onnx",
                  input_names=input_names,
                  output_names=output_names)

def save_with_shape(fname, fnamewsi):
    model = onnx.load(fname)
    mode_si = onnx.shape_inference.infer_shapes(model)
    onnx.save(mode_si, fnamewsi)

save_with_shape("simple_encoder_01.onnx", "simple_encoder_01_si.onnx")

image

chunseoklee commented 2 years ago

In Candidate 2, by replacing GRU with LSTM, we will get a model without WHILE and with UnidirectionalLSTM, which is an operation on TFLite. model_LSTM_keras.zip

chunseoklee commented 1 year ago

cc @ragmani

I tried to obtain full(w8a8) quantized model as the follows, but fail to get full quantized one.

# !pip install tensorflow==2.7.0
import numpy as np
import tensorflow as tf

def representative_dataset():
    for _ in range(100):
      data = np.random.rand(1, 28, 28)
      yield [data.astype(np.float32)]

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(28, 28), name='input'),
    tf.keras.layers.GRU(20, time_major=False, return_sequences=True),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='output')
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

run_model = tf.function(lambda x: model(x))
# This is important, let's fix the input size.
BATCH_SIZE = 1
STEPS = 28
INPUT_SIZE = 28
concrete_func = run_model.get_concrete_function(
    tf.TensorSpec([BATCH_SIZE, STEPS, INPUT_SIZE], model.inputs[0].dtype))

# model directory.
MODEL_DIR = "keras_lstm"
model.save(MODEL_DIR, save_format="tf", signatures=concrete_func)

converter = tf.lite.TFLiteConverter.from_saved_model(MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()

with open('model_q8.tflite', 'wb') as f:
  f.write(tflite_model)

Here is a netron snapshot:

![image](https://user-images.githubusercontent.com/4862887/191427854-2149689a-5860-457b-8448-2f5559084ac4.png)

which contains "dequantize" and "quantize" around While operation. It seems that TFLiteConverter does not support quantization for While op.

ragmani commented 1 year ago

I tried to quantize body subgraph but failed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "quantize.py", line 33, in optimized_circle = onecc.optimize(circle, options=optimize_options) File "/usr/local/lib/python3.8/dist-packages/onecc-0.1.0+220921195027-py3.8.egg/onecc/commands/optimize/init.py", line 44, in optimize File "/usr/local/lib/python3.8/dist-packages/onecc-0.1.0+220921195027-py3.8.egg/onecc/cli/onecc.py", line 64, in invoke onecc.errors.CommandError: Error while running command:

$ /usr/bin/onecc optimize --input_path /tmp/onecc_afwww81g/model_body.0.import.circle --output_path /tmp/onecc_afwww81g/model_body.0.import.0.opt.circle --fuse_add_with_tconv --fuse_add_with_fully_connected --fuse_batchnorm_with_conv --fuse_batchnorm_with_tconv --fuse_batchnorm_with_dwconv --fuse_activation_function --fuse_instnorm --fold_dequantize --fold_densify --substitute_padv2_to_pad --substitute_splitv_to_split --substitute_squeeze_to_reshape --resolve_customop_add --resolve_customop_batchmatmul --resolve_customop_max_pool_with_argmax --resolve_customop_splitv --transform_min_max_to_relu6 --transform_min_relu_to_relu6 --replace_non_const_fc_with_batch_matmul

[EXIT CODE] 255 [STDOUT] [STDERR] circle2circle: ERROR: loco::must_cast() failed to cast: PN4luci11CircleConstE

Try re-running the command from the command line.

If you see the same error message from the command line, You are ready report an issue to: https://github.com/Samsung/ONE/issues.

When reporting an issue, please make sure you attach the below information.

  1. Installed one-compiler version (can be found with dpkg-query -s one-compiler)
  2. Full command and the necessary files to reproduce the error

Here is scripts and the body subg model to reproduce.

<Details>

- Create a tflite model with a while op
```python3
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

# Load a dataset
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

# Build a training pipeline
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(
    normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)

# Build an evaluation pipeline
ds_test = ds_test.map(
    normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(28, 28), name='input'),
    tf.keras.layers.GRU(20, time_major=False, return_sequences=True),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='output')
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(
    ds_train,
    epochs=1,
    validation_data=ds_test,
)

model.summary()

run_model = tf.function(lambda x: model(x))
# This is important, let's fix the input size.
BATCH_SIZE = 1
STEPS = 28
INPUT_SIZE = 28
concrete_func = run_model.get_concrete_function(
    tf.TensorSpec([BATCH_SIZE, STEPS, INPUT_SIZE], model.inputs[0].dtype))

converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
# NOTE Do not set converter.optimizations. It converts weights in the model to quantized int8. "onecc" throws errors when quantizing models because "onecc" does not support to quantize models that have quantized weights.
tflite_model = converter.convert()

tflite_path='model.tflite'
with open(tflite_path, 'wb') as f:
  f.write(tflite_model)

Load a dataset

(ds_train, ds_test), ds_info = tfds.load( 'mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True, with_info=True, ) train_images = [image for image, label in ds_train]

import onecc import onecc.experimental.auto

quantized_circle_path = 'model_body.q8.circle' body_tflite_path = 'model_body.tflite' dtype = 'uint8'

Get default options (experimental feature)

import_options = onecc.experimental.auto.get_import_options(model='tflite', backend='tv2') optimize_options = onecc.experimental.auto.get_optimize_options(model='tflite', backend='tv2') quantize_options = onecc.experimental.auto.get_quantize_options(model='tflite', backend='tv2')

Prepare representative dataset for quantization

TODO get random sample

representative_dataset = [ (np.array(i).astype(np.int32), np.array(i).astype(np.int32), np.random.rand(1, 20).astype(np.float32), train_images[i].numpy().reshape(28,1,28).astype(np.float32)) for i in range(5) ]

Import, optimize, and quantize the model

circle = onecc.import_tflite(body_tflite_path, options=import_options) optimized_circle = onecc.optimize(circle, options=optimize_options) quantized_circle = onecc.quantize(optimized_circle, dataset=representative_dataset, quantized_dtype=dtype, options=quantize_options)

Save the generated model

quantized_circle.save(quantized_circle_path)



[model_body.zip](https://github.com/Samsung/ONE/files/9625865/model_body.zip)

</Details>
seanshpark commented 1 year ago

circle2circle: ERROR: loco::must_cast() failed to cast: PN4luci11CircleConstE

@ragmani , please share input .circle file that was used for /usr/bin/onecc optimize

ragmani commented 1 year ago

Here is the input .circle file model_body.0.import.zip

seanshpark commented 1 year ago

For testing, using model_body.cfg

one-optimize -C model_body.cfg
[one-optimize]
input_path=model_body.0.import.circle
output_path=model_body.0.import.0.opt.circle
fuse_add_with_tconv=True
fuse_add_with_fully_connected=True
fuse_batchnorm_with_conv=True
fuse_batchnorm_with_tconv=True
fuse_batchnorm_with_dwconv=True
fuse_activation_function=True
fuse_instnorm=True
fold_dequantize=True
fold_densify=True
substitute_padv2_to_pad=True
substitute_splitv_to_split=True
substitute_squeeze_to_reshape=True
resolve_customop_add=True
resolve_customop_batchmatmul=True
resolve_customop_max_pool_with_argmax=True
resolve_customop_splitv=True
transform_min_max_to_relu6=True
transform_min_relu_to_relu6=True
replace_non_const_fc_with_batch_matmul=True
ragmani commented 1 year ago

The model seems to have dynamic tensors that are outputs of Slice op. image

ragmani commented 1 year ago

I tried to quantize the body model after removing dynamic tensors.

[EXIT CODE] 255 [STDOUT] [STDERR] /usr/share/one/bin/record-minmax: ERROR: Wrong number of inputs.


- Cut model
```bash
$ echo "0-18 20-21 23 25" > opcode.txt
$ python3 tools/tflitefile_tool/select_operator.py -g 2 model.tflite opcode.txt model_body.tflite
chunseoklee commented 1 year ago

@ragmani https://github.com/Samsung/ONE/files/9630977/model_body.0.import.0.opt.zip consists of two graphs.

seanshpark commented 1 year ago

circle2circle: ERROR: loco::must_cast() failed to cast: PN4luci11CircleConstE

direct reason: loco::NodeShape infer_slice(const luci::CircleSlice *node) fails

image

Slice input is Concat which is not Const as currently we only support Const

ragmani commented 1 year ago

Slice input is Concat which is not Const as currently we only support Const

Thanks for your kind response. If the Slice input is not const, Slice op produces a dynamic output. So, in this issue, it would be better to proceed by quantizing the model with Slice ops removed such as https://github.com/Samsung/ONE/issues/8747#issuecomment-1255822642

ragmani commented 1 year ago

I tried to quantize the body model after removing dynamic tensors.

  • Error messege ... /usr/share/one/bin/record-minmax: ERROR: Wrong number of inputs.

It's my mistake. I tried to quantize the model with wrong representative inputs.

seanshpark commented 1 year ago

onecc quantize \ --input_path model_body.0.import.0.opt.circle \ --output_path model_body.0.import.0.opt.0.q.circle \ --granularity channel --quantized_dtype uint8

this gave me

Recording 0'th data Recording 1'th data Recording 2'th data Recording finished. Number of recorded data: 3 circle_quantizer: ERROR: Wrong data type detected in while/add_5

ragmani commented 1 year ago

I tried to proceed to quantize the model but I got another error. error_wrong_data_type_detected_in_while-add_5.zip

/usr/bin/onecc quantize --input_path model_body.0.import.0.opt.circle --output_path model_body.0.import.0.opt.0.q.circle --granularity channel --quantized_dtype uint8 --input_data dataset.0.h5
Recording 0'th data
Recording 1'th data
Recording finished. Number of recorded data: 2
circle_quantizer: ERROR: Wrong data type detected in while/add_5
seanshpark commented 1 year ago

while/add_5 is int32 type... ping @jinevening

ragmani commented 1 year ago

@jinevening Please take a look at https://github.com/Samsung/ONE/issues/8747#issuecomment-1255986000

jinevening commented 1 year ago

Ah, sorry. I missed the comment. I'm working on supporting int32 operators in quantizer.

Please note that int32 operators will not be quantized, but left as-is. So backend will receive int32 operators.

jinevening commented 1 year ago

https://github.com/Samsung/ONE/pull/9805 will resolve the problem.

ragmani commented 1 year ago

@jinevening Thanks for your help. I checked it works well.

ragmani commented 1 year ago

I compiled the model, but almost half of body graph was cut by removing the part that couldn't be compiled for running on trix backend. I'll try to test the compiled model with trix backend.

This is the model in circle version. gru_body_model.zip

Scripts

- Create a tflite model with a while op ```python3 import numpy as np import tensorflow as tf import tensorflow_datasets as tfds # Load a dataset (ds_train, ds_test), ds_info = tfds.load( 'mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True, with_info=True, ) # Build a training pipeline def normalize_img(image, label): """Normalizes images: `uint8` -> `float32`.""" return tf.cast(image, tf.float32) / 255., label ds_train = ds_train.map( normalize_img, num_parallel_calls=tf.data.AUTOTUNE) ds_train = ds_train.cache() ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples) ds_train = ds_train.batch(128) ds_train = ds_train.prefetch(tf.data.AUTOTUNE) # Build an evaluation pipeline ds_test = ds_test.map( normalize_img, num_parallel_calls=tf.data.AUTOTUNE) ds_test = ds_test.batch(128) ds_test = ds_test.cache() ds_test = ds_test.prefetch(tf.data.AUTOTUNE) model = tf.keras.models.Sequential([ tf.keras.layers.Input(shape=(28, 28), name='input'), tf.keras.layers.GRU(20, time_major=False, return_sequences=True), tf.keras.layers.Flatten(), tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='output') ]) model.compile( optimizer=tf.keras.optimizers.Adam(0.001), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[tf.keras.metrics.SparseCategoricalAccuracy()], ) model.fit( ds_train, epochs=1, validation_data=ds_test, ) model.summary() run_model = tf.function(lambda x: model(x)) # This is important, let's fix the input size. BATCH_SIZE = 1 STEPS = 28 INPUT_SIZE = 28 concrete_func = run_model.get_concrete_function( tf.TensorSpec([BATCH_SIZE, STEPS, INPUT_SIZE], model.inputs[0].dtype)) converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func]) # NOTE Do not set converter.optimizations. It converts weights in the model to quantized int8. "onecc" throws errors when quantizing models because "onecc" does not support to quantize models that have quantized weights. tflite_model = converter.convert() tflite_path='model.tflite' with open(tflite_path, 'wb') as f: f.write(tflite_model) ``` - Cut only the body graph ```bash $ echo "1-2 4-16 23" > opcode.txt $ python3 tools/tflitefile_tool/select_operator.py -g 2 model.tflite opcode.txt model_body.tflite ``` - Quantize the body graph ```python3 import numpy as np import tensorflow as tf ''' import tensorflow_datasets as tfds # Load a dataset (ds_train, ds_test), ds_info = tfds.load( 'mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True, with_info=True, ) train_images = [image for image, label in ds_train] ''' import onecc import onecc.experimental.auto quantized_circle_path = 'model_body.q8.circle' body_tflite_path = 'model_body.tflite' dtype = 'uint8' # Get default options (experimental feature) import_options = onecc.experimental.auto.get_import_options(model='tflite', backend='tv2') optimize_options = onecc.experimental.auto.get_optimize_options(model='tflite', backend='tv2') quantize_options = onecc.experimental.auto.get_quantize_options(model='tflite', backend='tv2') # Prepare representative dataset for quantization # TODO get random sample #representative_dataset = [ ( np.array(i).astype(np.int32), np.random.rand(1, 20).astype(np.float32), train_images[i].numpy().reshape(28,1,28).astype(np.float32) ) for i in range(5) ] representative_dataset = [ ( np.random.rand(1, 20).astype(np.float32) * 255, np.random.rand(1, 28).astype(np.float32) * 255 ) for i in range(5) ] # Import, optimize, and quantize the model circle = onecc.import_tflite(body_tflite_path, options=import_options) optimized_circle = onecc.optimize(circle, options=optimize_options) quantized_circle = onecc.quantize(optimized_circle, dataset=representative_dataset, quantized_dtype=dtype, options=quantize_options) # Save the generated model quantized_circle.save(quantized_circle_path) ```
ragmani commented 1 year ago

I've heard from @ejjeong that we can consider using the model below. https://github.sec.samsung.net/AIP/NPU_Compiler/blob/8b4825a9a83826b79ec75ece8fc40ff1716b7ff3/res/Collab/Issue/13310/caption_image.ptmex#L45

It is a model that has already been proven to run after unrolling. However there are two issues with running the model on onert.

  1. Is there any way to convert rnn onnx model to circle model without unrolling?
  2. Is there any way to cut rnn circle model?
ragmani commented 1 year ago

I made a tvn file of the model in https://github.com/Samsung/ONE/issues/8747#issuecomment-1260829755 and tried to run it manually. It works well.

model_body.q8.zip

$ BACKENDS=trix /usr/bin/nnfw-test/Product/out/bin/nnpackage_run model_body.q8 --load:raw model_body.q8/input_0.tv2b --dump:raw output.tv2b -w 10 -r 100
Package Filename model_body.q8
output.tv2b.0 is generated.
===================================
MODEL_LOAD   takes 1.741 ms
PREPARE      takes 10.608 ms
EXECUTE      takes 1.262 ms
- MEAN     :  1.262 ms
- MAX      :  5.274 ms
- MIN      :  0.782 ms
- GEOMEAN  :  1.134 ms
===================================