Optimising d2go for rasberry pi

amrahsmaytas commented 3 years ago

🚀 Feature

Optimising d2go models for rasberry pi along with multi threading in order to use all the available 4 cores on rasberry pi

Motivation & Examples

I had tried running d2go , qat optimised version on a rasberry pi 4 which has a Arm-Neon architecture, but when I read about the qat backend as per their official repo, it's only supported for mobile arm architecture, thus the inference speed on the rasberry pi running the qat optimised model with backend "qnnpack" Gives me output roughly around 3-4 seconds , but I wanna achieve the mentioned speed in d2go repo which is 0.005seconds ( 50 milliseconds) for its pre-trained models (will look into custom models, once it's done).

Can anyone guide me on optimising further the qat model for the Arm-Neon (rasberry architecture) or any other best way where I can optimise the above d2go model to achieve the speed mentioned in the d2go repo which is around 50 milliseconds

Describe what the feature would look like, if it is implemented.

If the feature gets implemented, it would less resource consuming and also lighting fast in terms of predicting output.

Looking forward to the community to help me out

Thanks

P.s: I was also thinking to experimenting with (detectron2) pytorch model --> onnx --> tensorflow --> tflite, but I am not sure whether that would work or not, and if did work, whether I would be able to the get the speed as mentioned in d2go official repo (50 milliseconds), would like to have suggestions on this part too.

@zhanghang1989 @petoor

maheshs11 commented 3 years ago

@wat3rBro It is taking 4 seconds to predict a frame in raspberry pi 4 .

aleannox commented 3 years ago

I am using a Raspberry Pi 3B+.

With faster_rcnn_fbnetv3a_dsmask_C4.yaml inference on one frame took ~10s. With downscaling from 320px to 160px I got down to ~5s. With downscaling and int8 quantization I got down to ~0.5s.

For downscaling I used d2go.utils.demo_predictor.DemoPredictor with min_size_test=112 and max_size_test=160. For quantization I followed https://github.com/facebookresearch/d2go/blob/main/demo/d2go_beginner.ipynb The default quantization engine did not work so I used

# https://github.com/pytorch/android-demo-app/issues/104
config = d2go.model_zoo.model_zoo.get_config('faster_rcnn_fbnetv3a_dsmask_C4.yaml')
config.QUANTIZATION.BACKEND = 'qnnpack'

for saving and

# https://github.com/pytorch/pytorch/issues/29327#issue-518778762
torch.backends.quantized.engine = 'qnnpack'

for loading.

amrahsmaytas commented 3 years ago

Hey @aleannox , thanks for sharing :)

Can you also share some more Information such as

did you see any accuracy drop after downscaling + int8 Quantization?

If yes, how much?

Also, how's your pi3b+ running, I mean what's the ram, memory and temp stats looks like.

Also, Curious to know, if you also tried some dl compilers such as apache tvm etc to optimise d2go for pi3b+ hardware?

Thanks, Satyam.

aleannox commented 3 years ago

Hey @amrahsmaytas,

Sure, happy to share :) Actually the 1s I mentioned was too pessimistic, I actually see ~0.5s inference time with downscaling and quantization.

I did not perform a quantitative analysis of accuracy. I am using the model to detect persons and for this purpose I did not notice an performance drop.

The usage stats of my Pi with inference running are

200% CPU for the process (of 4 cores)
30% RAM for the process (of 873MB)
83C temp

I did not try dl compilers, for my purpose the 0.5s are sufficient.

Hope this helps :)

amrahsmaytas commented 3 years ago

Hey @amrahsmaytas,

Sure, happy to share :) Actually the 1s I mentioned was too pessimistic, I actually see ~0.5s inference time with downscaling and quantization.

I did not perform a quantitative analysis of accuracy. I am using the model to detect persons and for this purpose I did not notice an performance drop.

The usage stats of my Pi with inference running are

200% CPU for the process (of 4 cores)

30% RAM for the process (of 873MB)

83C temp

I did not try dl compilers, for my purpose the 0.5s are sufficient.

Hope this helps :)

Yup, thanks for sharing 😃✌🏻

amrahsmaytas commented 3 years ago

Hey @amrahsmaytas,

Sure, happy to share :) Actually the 1s I mentioned was too pessimistic, I actually see ~0.5s inference time with downscaling and quantization.

I did not perform a quantitative analysis of accuracy. I am using the model to detect persons and for this purpose I did not notice an performance drop.

The usage stats of my Pi with inference running are

200% CPU for the process (of 4 cores)

30% RAM for the process (of 873MB)

83C temp

I did not try dl compilers, for my purpose the 0.5s are sufficient.

Hope this helps :)

Can you also let me know, whether the raspberry os you used on your pi3b is 64bit or 32bit?

Do you have information about segmentation too?

And, can you also please share the code to my mail greetsatyamsharma@gmail.com , I would be really thankfull 😌 😃

Thanks in advance, Satyam

maheshs11 commented 3 years ago

I am using a Raspberry Pi 3B+.

With faster_rcnn_fbnetv3a_dsmask_C4.yaml inference on one frame took ~10s. With downscaling from 320px to 160px I got down to ~5s. With downscaling and int8 quantization I got down to ~0.5s.

For downscaling I used d2go.utils.demo_predictor.DemoPredictor with min_size_test=112 and max_size_test=160. For quantization I followed https://github.com/facebookresearch/d2go/blob/main/demo/d2go_beginner.ipynb The default quantization engine did not work so I used
# https://github.com/pytorch/android-demo-app/issues/104
config = d2go.model_zoo.model_zoo.get_config('faster_rcnn_fbnetv3a_dsmask_C4.yaml')
config.QUANTIZATION.BACKEND = 'qnnpack'
for saving and
# https://github.com/pytorch/pytorch/issues/29327#issue-518778762
torch.backends.quantized.engine = 'qnnpack'
for loading.

can u share more details ? 64 bit or 32 bit os if possible can u share code to ? shivarajmahesh11@gmail.com

aleannox commented 3 years ago

Hi guys

I am using a 32bit OS. I have not tried segmentation because I don't need it for my project. And you can find my code here: https://github.com/aleannox/leo/blob/main/vision.py

Cheers

amrahsmaytas commented 3 years ago

Hi guys

I am using a 32bit OS. I have not tried segmentation because I don't need it for my project. And you can find my code here: https://github.com/aleannox/leo/blob/main/vision.py

Cheers

Thanks @aleannox 😃

GeorgePearse commented 1 year ago

I don't see faster_rcnn_fbnetv3a_dsmask_C4.yaml in the model zoo, and I've had some trouble training it well, how should it compare to faster_rcnn_fbnetv3g_fpn.yaml?

facebookresearch / d2go