Have you performed inference with the TFLite models?

sayakpaul commented 1 year ago

Amazing work here!

I wanted to know if you have performed inference with the converted TFLite models.

freedomtan commented 1 year ago

Amazing work here!

I wanted to know if you have performed inference with the converted TFLite models.

Yes and no.

Yes, I did verify that given same inputs I could get expected outputs from TFLite models as from corresponding Keras models.
Nope,
1. the goal is to run them on Android devices and have them accelerated by NPU on Android devices. Last time when I tried, group normalization could not be accelerated :-(
2. there is no tokenizer, random number generator, and noise scheduler/samples on Android.

sayakpaul commented 1 year ago

I see.

I think if you can still come up with a Colab Notebook that shows the end-to-end inference pipeline with the converted models, it would greatly be beneficial to the community.

sayakpaul commented 1 year ago

Also, did you try converting using dynamic-range quantization?

freedomtan commented 1 year ago

I see.

I think if you can still come up with a Colab Notebook that shows the end-to-end inference pipeline with the converted models, it would greatly be beneficial to the community.

Yup, that's doable. Steal Keras or PyTorch code and change inference part, then it's done. I could find some time during the upcoming Lunar New Year holidays to work on it.

sayakpaul commented 1 year ago

Keeping the issue open until then. Thank you so much!

freedomtan commented 1 year ago

Also, did you try converting using dynamic-range quantization?

nope, I tried fp16 quantazion for Stable Diffusion before. It's trivial. I expect dynamic-range quantization to be trivial too.

sayakpaul commented 1 year ago

The only reason I asked is that the resulting models would be smaller than FP16 and might be less effort than pure int8 quantization. Depending on the speed and size benefits, I think it's worth trying for.

freedomtan commented 1 year ago

Keeping the issue open until then. Thank you so much!

done

https://github.com/freedomtan/keras_cv_stable_diffusion_to_tflite/commit/7b44ce2d985c3ceedc8163aba0a2d1781107e2c0

freedomtan commented 1 year ago

The only reason I asked is that the resulting models would be smaller than FP16 and might be less effort than pure int8 quantization. Depending on the speed and size benefits, I think it's worth trying for.

as expected, conversion using dynamic range is quite trivial. I didn't verify if it works though. https://github.com/freedomtan/keras_cv_stable_diffusion_to_tflite/blob/main/convert_to_tflite_models_with_dynamic_range.py

sayakpaul commented 1 year ago

@freedomtan, which TF version did you use for the above? With TF 2.11.0, the decoder conversion fails without SELECT ops.

freedomtan commented 1 year ago

@freedomtan, which TF version did you use for the above? With TF 2.11.0, the decoder conversion fails without SELECT ops.

@sayakpaul tf master branch I built about two weeks ago. Supposedly recent tf-nightly within last couple weeks should work.

sayakpaul commented 1 year ago

Hmm. With TF 2.11.0, the inference also fails. I will try it out tf-nightly: https://colab.research.google.com/gist/sayakpaul/479139ceb0be68234e91f38eb1427c5d/scratchpad.ipynb

freedomtan commented 1 year ago

Hmm. With TF 2.11.0, the inference also fails. I will try it out tf-nightly: https://colab.research.google.com/gist/sayakpaul/479139ceb0be68234e91f38eb1427c5d/scratchpad.ipynb

Sorry, I couldn't tell which tensor is wrong from the trace. I guess maybe some of the 13 tensors between the two chunks of the diffusion model are not arranged properly. I'll check if I could do inference with tf 2.11.

freedomtan commented 1 year ago

dynamic range quant seems to work. the results usually are different from fp32 models, but look reasonable.

https://github.com/freedomtan/keras_cv_stable_diffusion_to_tflite/blob/main/text_to_image_using_converted_tflite_models_dynamic.ipynb

freedomtan commented 1 year ago

Hmm. With TF 2.11.0, the inference also fails. I will try it out tf-nightly: https://colab.research.google.com/gist/sayakpaul/479139ceb0be68234e91f38eb1427c5d/scratchpad.ipynb

it works in my python 3.8 + tf 2.11.0 + keras_cv 0.4.1 environment.

I guess the problem is the order of input tensors. With

import tensorflow as tf

i_first = tf.lite.Interpreter('/tmp/sd_diffusion_model_first.tflite')
first_input_details = i_first.get_input_details()
first_output_details = i_first.get_output_details()

for i in first_input_details:
    print(i)

print("")

for o in first_output_details:
    print(o)

print("")
i_second = tf.lite.Interpreter('/tmp/sd_diffusion_model_second.tflite')
second_input_details = i_second.get_input_details()
second_output_details = i_second.get_output_details()

for i in second_input_details:
    print(i)

The tensors of my two models

{'name': 'serving_default_input_1:0', 'index': 0, 'shape': array([  1,  77, 768], dtype=int32), 'shape_signature': array([ -1,  77, 768], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_input_3:0', 'index': 1, 'shape': array([ 1, 64, 64,  4], dtype=int32), 'shape_signature': array([-1, 64, 64,  4], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_input_2:0', 'index': 2, 'shape': array([  1, 320], dtype=int32), 'shape_signature': array([ -1, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}

{'name': 'StatefulPartitionedCall:7', 'index': 797, 'shape': array([  1,  64,  64, 320], dtype=int32), 'shape_signature': array([ -1,  64,  64, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_input_1:0', 'index': 0, 'shape': array([  1,  77, 768], dtype=int32), 'shape_signature': array([ -1,  77, 768], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:12', 'index': 2145, 'shape': array([   1,   16,   16, 1280], dtype=int32), 'shape_signature': array([  -1,   16,   16, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:5', 'index': 1067, 'shape': array([  1,  32,  32, 320], dtype=int32), 'shape_signature': array([ -1,  32,  32, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:2', 'index': 437, 'shape': array([  1,  64,  64, 320], dtype=int32), 'shape_signature': array([ -1,  64,  64, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:9', 'index': 1337, 'shape': array([  1,  32,  32, 640], dtype=int32), 'shape_signature': array([ -1,  32,  32, 640], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:3', 'index': 1607, 'shape': array([  1,  16,  16, 640], dtype=int32), 'shape_signature': array([ -1,  16,  16, 640], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:11', 'index': 1877, 'shape': array([   1,   16,   16, 1280], dtype=int32), 'shape_signature': array([  -1,   16,   16, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:10', 'index': 1605, 'shape': array([  1,  32,  32, 640], dtype=int32), 'shape_signature': array([ -1,  32,  32, 640], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:4', 'index': 2147, 'shape': array([   1,    8,    8, 1280], dtype=int32), 'shape_signature': array([  -1,    8,    8, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:0', 'index': 434, 'shape': array([   1, 1280], dtype=int32), 'shape_signature': array([  -1, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:6', 'index': 2851, 'shape': array([   1,    8,    8, 1280], dtype=int32), 'shape_signature': array([  -1,    8,    8, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'StatefulPartitionedCall:8', 'index': 1065, 'shape': array([  1,  64,  64, 320], dtype=int32), 'shape_signature': array([ -1,  64,  64, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}

{'name': 'serving_default_args_0_8:0', 'index': 0, 'shape': array([  1,  32,  32, 640], dtype=int32), 'shape_signature': array([ -1,  32,  32, 640], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_3:0', 'index': 1, 'shape': array([   1,   16,   16, 1280], dtype=int32), 'shape_signature': array([  -1,   16,   16, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_7:0', 'index': 2, 'shape': array([  1,  32,  32, 640], dtype=int32), 'shape_signature': array([ -1,  32,  32, 640], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_12:0', 'index': 3, 'shape': array([  1,  64,  64, 320], dtype=int32), 'shape_signature': array([ -1,  64,  64, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_9:0', 'index': 4, 'shape': array([  1,  32,  32, 320], dtype=int32), 'shape_signature': array([ -1,  32,  32, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_5:0', 'index': 5, 'shape': array([   1,   16,   16, 1280], dtype=int32), 'shape_signature': array([  -1,   16,   16, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0:0', 'index': 6, 'shape': array([   1,    8,    8, 1280], dtype=int32), 'shape_signature': array([  -1,    8,    8, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_6:0', 'index': 7, 'shape': array([  1,  16,  16, 640], dtype=int32), 'shape_signature': array([ -1,  16,  16, 640], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_10:0', 'index': 8, 'shape': array([  1,  64,  64, 320], dtype=int32), 'shape_signature': array([ -1,  64,  64, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_11:0', 'index': 9, 'shape': array([  1,  64,  64, 320], dtype=int32), 'shape_signature': array([ -1,  64,  64, 320], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_2:0', 'index': 10, 'shape': array([   1, 1280], dtype=int32), 'shape_signature': array([  -1, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_1:0', 'index': 11, 'shape': array([   1,    8,    8, 1280], dtype=int32), 'shape_signature': array([  -1,    8,    8, 1280], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}
{'name': 'serving_default_args_0_4:0', 'index': 12, 'shape': array([  1,  77, 768], dtype=int32), 'shape_signature': array([ -1,  77, 768], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}

sayakpaul commented 1 year ago

I fixed the ordering issue: https://github.com/sayakpaul/Adventures-in-TensorFlow-Lite/blob/master/Stable_Diffusion_to_TFLite.ipynb.

Would be also cool to extend the TFLite models to handle the situation when batch_size is greater than 1 when calling generate_image().

freedomtan commented 1 year ago

I fixed the ordering issue: https://github.com/sayakpaul/Adventures-in-TensorFlow-Lite/blob/master/Stable_Diffusion_to_TFLite.ipynb.

Would be also cool to extend the TFLite models to handle the situation when batch_size is greater than 1 when calling generate_image().

TFLite models do support dynamic batch size, to use tflite models with batch size > 1, call resize_tensor_input before callingallocate_tensors()

sayakpaul commented 1 year ago

Thanks! Fixed the problem in the latest commit to the notebook.

freedomtan commented 1 year ago

@sayakpaul FYI, on my MacBook Pro M1 8 GB machine, inference with dynamic range quantized models is much faster than 32-bit floating point ones (> 25 mins vs. < 6 mins)

sayakpaul commented 1 year ago

Maybe because the installation was built from source for your machine. Would be interesting to see what happens when you use tf-nightly. That is a more reasonable comparison in my opinion.

freedomtan commented 1 year ago

@sayakpaul I wrote minimal C++ and tested it on Android devices and macOS.

https://github.com/freedomtan/keras_cv_stable_diffusion_to_tflite/tree/main/cpp_glue_code

sayakpaul commented 1 year ago

@farmaker47

freedomtan / keras_cv_stable_diffusion_to_tflite

Have you performed inference with the TFLite models? #1