Slow Segmentation with MobileNetV3

KevinWang905 commented 1 year ago

Hi Tobias,

I'm trying to implement a segmentation model with mobilenetv3 (tensorflow mobilenetv3_large minimalistic) with a LR-ASPP segmentation head that I trained in python. When converting my model, the forward passes take < 1s, but when I load it in C++, the forward pass takes 8s. I am using WSL running Ubuntu 22.04. I'm pretty new to development in C++ so there's likely some compilation mistakes I may have made, but I would love to get your feedback on why this speed discrepancy exists. I've posted the model conversion and loading outputs below. I can send you the model json as well.

> python keras_export/convert_model.py mnv3_LRASPP_min_epoch1 fdeep_mnv3_min_e1.json

loading mnv3_LRASPP_min_epoch1
2023-10-18 17:13:07.254822: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
1/1 [==============================] - 0s 390ms/step
Forward pass took 0.430125 s.
1/1 [==============================] - 0s 29ms/step
Forward pass took 0.052999 s.
1/1 [==============================] - 0s 29ms/step
Forward pass took 0.058005 s.
Starting performance measurements.
1/1 [==============================] - 0s 29ms/step
Forward pass took 0.057001 s.

> g++ -O3 -DNDEBUG -march=native -msse -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 main.cpp
> ./a.out

Loading json ... done. elapsed time: 0.036306 s
Building model ... done. elapsed time: 0.032201 s
Running test 1 of 1 ... done. elapsed time: 8.861812 s
Loading, constructing, testing of fdeep_mnv3_min_e1.json took 8.936081 s overall.
model loaded successfully

Appreciate the work you've put into this library. Thanks! Kevin

Dobiasd commented 1 year ago

Hi Kevin,

thanks for the good report.

Your C++-compiler invocation looks ok for speed. It has the important -O3 -DNDEBUG is there.

Since you have -march=native too, I think you can remove all the other -m... things, i.e., have

g++ -O3 -DNDEBUG -march=native main.cpp in the end.

Can you give this a try?

If it does not help, could you upload your model (the not-yet converted version) for me to experiment with it and find the bottleneck?

KevinWang905 commented 1 year ago

Thanks! I've tried that before and it has the same speed. I've sent an email to you containing the link to my model and some testing code. Let me know if you need anything else.

Dobiasd commented 1 year ago

Thank you. With the model you sent me, I just reproduced the performance problem locally. It's actually even worse on my machine, i.e.:

Forward pass using TensorFlow (no GPU, just one CPU core allowed): ~ 0.07 s
Forward pass using frugally-deep: ~ 21 s :scream:

I'll investigate and get back to you here.

Dobiasd commented 1 year ago

Profiling (sysprof) showed, all the CPU time is burned exactly here.

In this MR, I introduced this unnecessarily large calculation (very redundant) accidentally. :grimacing:

I just fixed it with this commit and released a new version.

Now, a forward pass with your model in frugally-deep is fast (~ 0.075 s on my machine). :tada:

Thanks a lot for reporting this and providing such a good explanation (plus the example model)! :heart:

KevinWang905 commented 1 year ago

Thank you!

Dobiasd / frugally-deep

Slow Segmentation with MobileNetV3 #405