Unity-Technologies / barracuda-release

Other
561 stars 76 forks source link

Recommended settings for mobile #258

Closed rfilkov closed 2 years ago

rfilkov commented 2 years ago

Hi, according to your experience, do you have any recommendations for player settings or worker type (or any other settings), when Barracuda inferences should be used on mobile devices (Android or iOS), and we need maximum performance? I'm testing a model that works pretty well on the laptop, but is quite slow on Android. I've not tested it on iOS yet.

AlexRibard commented 2 years ago

Typically that would depend on the type of model you're dealing with. Image-based models are much better suited to run on the GPU. Smaller transformer types can run efficiently on the CPU. One other thing to consider is where your input/output resides. if it is on the GPU (texture) or CPU (animation data). In this case you have to balance if syncing CPU/GPU is worth it in regards to inference speed gains

AlexRibard commented 2 years ago

on mobile, PixelShader worker can be quite efficient too

rfilkov commented 2 years ago

Thank you @AlexRibard! I think the IL2CPP backend also increases the performance a bit. It may be useful to include these kinds of recommendations in the Barracuda documentation. This particular model has a texture as input and 4 tensors (data) as output. I've currently commented out the copying of output data and its processing, but it's still quite slow on Android. Even with the pixel shader. I can share it, if you'd like to take a closer look.

AlexRibard commented 2 years ago

Yes, The C#Burst back-end is using IL2CPP. In your case I'd recommend using either the Pixelshader or ComputePrecompile backend Can you share your model on this thread? That way I can have a better idea

rfilkov commented 2 years ago

I've set the worker type to Auto, and then do this: 'workerType = WorkerFactory.ValidateType(workerType);'. So, I suppose the optimal worker type for the platform gets selected. Here is the link to the model.

AlexRibard commented 2 years ago

Thanks. Yeah ComputePrecompiled/Pixel shader will provide the best performance. Your model is using Relu6 which we don't fuse with Conv right now. We've seen more models using this instead of Relu, so we could fuse those to improve performance.

Apart from that, you could split inference over a few frames https://github.com/Unity-Technologies/barracuda-release/blob/release/2.4.0/Documentation~/ModelExecution.md

rfilkov commented 2 years ago

@AlexRibard may I ask you to take a look at one more model, when you have some time. Again, working OK on desktop, and with 3-5 fps on Android.

AlexRibard commented 2 years ago

For this one I'd say the same thing, Clip is not fused. We are planning to extend fused activation, but for now only Relu are fused with Conv. For this model, the end of the model contains a number of transposes. We do not remove them to keep the model output layout unchanged, but if you can remove them it'll be good for performance. Apart from that, channel counts of the Conv could be % 16 for better mobile performance.

AlexRibard commented 2 years ago

For your case I'd look into model splitting https://github.com/Unity-Technologies/barracuda-release/blob/release/2.4.0/Documentation~/ModelExecution.md or reducing input image size.

(closing the thread as immediate solutions have been explained, but feel free to comment if you have more questions)

rfilkov commented 2 years ago

Thank you, Alex! Hm, I think this Clip was a Relu before the conversion to onnx format. And the hint regarding transposes at the end looks very promising. I only need to find out how to modify the model. Thanks again!

Regarding the splitting of model layers: I don't think this would be a good solution. It would only cause a delay between the real-time image and the inference results. Barracuda mainly uses textures, shaders & buffers for its inferences, and they can only be utilized in the main thread. And PeekOutput() - in the same frame.

AlexRibard commented 2 years ago

For the splitting you would split up over a few frames (1/2 only) so it all depends on your use case yes