allo- / virtual_webcam_background

Use a virtual webcam background and overlays with body-pix and v4l2loopback
GNU General Public License v3.0
307 stars 47 forks source link

Using quantised model #46

Open Nerdyvedi opened 3 years ago

Nerdyvedi commented 3 years ago

Hi, For some unknown reason the quantised weights do not work. But I was able to find a workaround. First converting the weights from json to SavedModel format, Then converting it to tflite and applying post training quantization, I was able to get faster results from the quantised model.

Let me know , If I should open a Pull request for this

allo- commented 3 years ago

Yes, I would like it, especially when it increases performance.

You can also have a look at https://github.com/allo-/virtual_webcam_background/issues/40#issuecomment-706603497 and https://github.com/de-code/python-tf-bodypix. I think they also appreciate help with the quantized models and I may migrate to use this library sometime (when I have the time to test it and see how easy it is to integrate).

de-code commented 3 years ago

With my python-tf-bodypix project, one could load the quant model like this:

python -m tf_bodypix \
    draw-mask \
    --source webcam:0 \
    --show-output \
    --threshold=0.75 \
    --add-overlay-alpha=0.5 \
    --colored \
    --model-path=https://storage.googleapis.com/tfjs-models/savedmodel/bodypix/mobilenet/quant1/075/model-stride16.json

Although I am not seeing any speedup that way. It stilll seem to be referring to floats in the model.

I have now pushed my wip tflite support branch PR. It kind of stalled because I didn't have a suitable tflite model at hand.

Perhaps you could share your tflite model?

Nerdyvedi commented 3 years ago

Hi, I have uploaded the resnet50 model with float 16 quantization. Please test it. I tested it on a CPU, and inference time decreased by around 25% Model

de-code commented 3 years ago

Thank you. I changed my branch to make it work with that model.

This could be tested via:

python -m tf_bodypix \
    draw-mask \
    --source webcam:0 \
    --show-output \
    --threshold=0.75 \
    --add-overlay-alpha=0.5 \
    --colored \
    --output-stride=16 \
    --model-path=/path/to/resnet_float_16.tflite

vs.

python -m tf_bodypix \
    draw-mask \
    --source webcam:0 \
    --show-output \
    --threshold=0.75 \
    --add-overlay-alpha=0.5 \
    --colored \
    --output-stride=16 \
    --model-path=https://storage.googleapis.com/tfjs-models/savedmodel/bodypix/resnet50/float/model-stride16.json

In my brief non-scientific test it seemed to be slower on my CPU. It might very well be my tflite implementation (it's a bit hacked together). There are probably a few optimisations, like not calling get_tensor for every output tensor.

The above commands will print out timings every second. For the model part it seems to be hovering around 430ms (but I have also seen it below for a few iterations). Whereas loading the tensorflow js model appears to be around 250ms. (That is then also reflected in the overall fps)

Would be interested to see what timings you are getting or what tflite integration you are using.

allo- commented 3 years ago

I need to take some time to test your library and tool with working CUDA. I am not sure, but maybe quantized models are optimized for less accurate but faster GPU pipelines?

de-code commented 3 years ago

I need to take some time to test your library and tool with working CUDA.

Yes, please do and let me know if there are any issues. There are no developer information at the moment but should be fairly straightforward (with make targets).

I am not sure, but maybe quantized models are optimized for less accurate GPU pipelines?

That's quite possible. Although @Nerdyvedi mentioned having tested it on a CPU as well. So it's probably something to do with the tflite integration as well.

Nerdyvedi commented 3 years ago

I used the following code to test tflite

Initialising interpreter interpreter = tf.lite.Interpreter(model_path="resnet_float_16.tflite") interpreter.allocate_tensors()

Then, inside the loop , getting the prediction interpreter.set_tensor(input_details[0]['index'], sample_image). interpreter.invoke() segment_logits = interpreter.get_tensor(211)

I just used it to get the mask

de-code commented 3 years ago

I had one slight performance issue in that I was resizing the tensors. Getting just the float_segments vs getting all tensors doesn't seem make a noticable difference. I am still getting much higher timings.

Are you not resizing the input tensor as well? By default it seems to have the resolution of 769x433 (width x height). Or what internal resolution are you using? (I am using 417x241).

What TensorFlow version are you using? (I am using 2.3.1)