bytecodealliance / wasm-micro-runtime

WebAssembly Micro Runtime (WAMR)
Apache License 2.0
4.79k stars 606 forks source link

WASI-NN should not apply input quantization #2611

Open CIPop opened 11 months ago

CIPop commented 11 months ago

Currently, the TFLite wasi-nn implementation performs quantization if quantization scale and zero-point exist (https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/iwasm/libraries/wasi-nn/src/wasi_nn_tensorflowlite.cpp#L323)

This results in poor performance with ssd_mobilenet_v1_1_metadata_1.tflite Direct download link.


The SSD mobilenet v1.1 model has the following input details:

import numpy as np
import tensorflow as tf
i = tf.lite.Interpreter(model_path="ssd_mobilenet_v1_1_metadata_1.tflite")
i.allocate_tensors()
input_details = i.get_input_details()[0]
input_details
{'name': 'normalized_input_image_tensor',
 'index': 175,
 'shape': array([  1, 300, 300,   3], dtype=int32),
 'shape_signature': array([  1, 300, 300,   3], dtype=int32),
 'dtype': numpy.uint8,               <--------------------------------------------------------
 'quantization': (0.0078125, 128),
 'quantization_parameters': {'scales': array([0.0078125], dtype=float32),
  'zero_points': array([128], dtype=int32),
  'quantized_dimension': 0},
 'sparsity_parameters': {}}

The model works well without the RGB input (300x300x3 uint8_t) being quantized. (See my bug at https://github.com/joonb14/TFLiteDetection/issues/1 for a full Jupyter Notebook example.) When I try to apply quantization (in either python or by running the input through wasi-nn) I get very poor results.

To work-around this issue, I had to apply an inverse function when creating the input tensor:

// Taken from the model's input_details:
#define QUANTIZATION_SCALE 0.007812
#define QUANTIZATION_ZERO_POINT 128.0

// in create_input(...)

    for (int i = 0; i < input.elements; ++i)
    {
        input.input_tensor[i] = data[i];
        // WAMR / wasi-nn bug. Model does not expect quantized data but it is done internally regardless:
        // Reversing the internal WAMR quantization:      it[i] = (uint8_t)(input_tensor_f[i] / scale + zero_point);
        input.input_tensor[i] = (input.input_tensor[i] - QUANTIZATION_ZERO_POINT) * QUANTIZATION_SCALE;
    }

    return input;
}

With above workaround, I get the exact same (good) results in both Python and when running with iwasm (wasi-nn enabled).

I'm confused by https://www.tensorflow.org/lite/performance/post_training_integer_quant#run_the_tensorflow_lite_models which states that if input_details['dtype'] == np.uint8: quantization should be applied to the input (what wasi-nn does)...

tonibofarull commented 11 months ago

Hi, if I visualize the model with netron I get the following, issue-pr-netron As you can see, the quantization section of the input indicates that the original distribution of your data is [-1, 1) (float/double). As the model has been trained with that range of values it is waiting for values between [-1, 1] so that the inferencer can quantize it. That is, convert from [-1, 1) to [0, 255] (uint8).

-1 (1/0.0078125) + 128 = 0 0.9921875 (1/0.0078125) + 128 = 255

In the Python inferencer the quantization is not done automatically. Therefore, it expects the user to do it. As you comment in https://github.com/joonb14/TFLiteDetection/issues/1, you are right, the processor lacks the preprocessing of the input tensor to quantize it. However, since instead of loading the images with the range of values with which the model has been trained, you have loaded the images in uint8 the quantization is implicit, by coincidence.

On the other hand, in wasi-nn the quantization is done internally and, therefore, expects the original range of values. In your solution you have transformed [0, 255], which is a data distribution that does not correspond to the one the model has been trained on, to [-1, 1), which is the valid one. In this way, the model will be able to perform the transformation to the correct range of values by itself.

Note that wasi-nn expects values in the range with which it has been trained. Any other assumption is (in most cases) wrong, since the distribution of the data when making the inference should be the same as the training (there are always exceptions).

tonibofarull commented 11 months ago

Note also that if we wanted the users themselves to quantize the values we have 2 options:

  1. We need a way for wasi-nn to pass the scale and offset information to them, which is not possible at this time.
  2. The user knows the values in advance and writes them in code.
abrown commented 11 months ago

@tonibofarull, I just skimmed this issue but perhaps the information that you're looking to pass on could be done if we added a new metadata feature to wasi-nn? I've been looking for examples where this would be useful. Would the metadata need to be attached to the tensor or the graph or the context?

tonibofarull commented 11 months ago

The metadata is already in the model, at least in the case of TFLite. The problem reported by @CIPop is that the input range expected by wasi-nn is the one used for training, which from my point of view is correct, instead of directly asking for the quantized version. In this case, the quantization turned out to be uint8 as well as the image format, but it could have been uint16 or any other, so those images would have to be scaled no matter what.

Perhaps what we can do is allow users to decide whether to quantize manually or let the runtime assume the input is that of the training.

CIPop commented 11 months ago

@tonibofarull I just verified that with the expected [-1..1] input range, WASI-NN performs as expected. Thank you for the in-depth explanation!

In WAMR's WASI-NN wasi_nn.h we could add documentation

The pre-processing should be:

  1. Obtain input - coincidentally, uint8_t `300x300x3, values [0..255]
  2. Transform input to float 300x300x3, values [-1..1]
  3. Based on the model input type (uint8_t), apply quantization parameters. This transforms the input back to uint8_t 300x300x3, values [0..255].

The important part is that, while the tensors in steps 1 and 3 have the same shape and type, the values are clearly different.

Tested in Python:

res_im = im.resize((300, 300))
np_res_im = np.array(res_im)

# Transform from input RGB [0..255] to [-1, 1]
np_res_im = (np_res_im / 255) * 2 - 1

# From https://www.tensorflow.org/lite/performance/post_training_integer_quant#run_the_tensorflow_lite_models
# Check if the input type is quantized, then rescale input data to uint8
if input_details['dtype'] == np.uint8:
    input_scale, input_zero_point = input_details["quantization"]
    np_res_im = np_res_im / input_scale + input_zero_point

np_res_im = np.expand_dims(np_res_im, axis=0).astype(input_details["dtype"])

# Quantized input [0..255].
print(np_res_im)

Tested in WASI-NN / C:

    for (int i = 0; i < input.elements; ++i)
    {
        // WASM-NN expects non-quantized RGB data (-1..1)
        input.input_tensor[i] = ((float)data[i] / 255) * 2 - 1;
    }

Given @tonibofarull's explanation, the official TFLite quantization documentation, I am now convinced this isn't a WASI-NN / TFlite implementation bug.

This explanation is a bit ambiguous:

Lets assume the expected image is 300x300 pixels, with three channels (red, blue, and green) per pixel. This should be fed to the model as a flattened buffer of 270,000 byte values (300x300x3). If the model is quantized, each value should be a single byte representing a value between 0 and 255.

The second sentence is true only if the model is indeed quantized. I would expect that non-quantized models would accept a flattened buffer of 270000 float values.

Feel free to close unless you'd like to keep open to add the extra meta-information API that allows external quantization.