NVIDIA-AI-IOT / Lidar_AI_Solution

A project demonstrating Lidar related AI solutions, including three GPU accelerated Lidar/camera DL networks (PointPillars, CenterPoint, BEVFusion) and the related libs (cuPCL, 3D SparseConvolution, YUV2RGB, cuOSD,).
Other
1.19k stars 204 forks source link

Spconv Engine Implementation #256

Open barrydoooit opened 1 week ago

barrydoooit commented 1 week ago

Hi @hopef !

I found a huge gap of inference latency on orin of BEVFusion's lidar-scn backbone, between using the engine returned by spconv::load_engine_from_onnx(30ms) and using the original pytorch implementation(100ms). Therefore I want to check how is that spconv engine handling the forward process and I wonder where may we get the implementation of that Spconv Engine class (defined in engine.hpp?

I suppose it's in the libspconv, but I got no clue of existence of this class from traveller59's spconv repo. Is it implemented additionally along with this Lidar_AI_Solution project and would not be released? Looking forward to your reply.

hopef commented 1 week ago

I'm sorry, but unfortunately we have no plan to public the libspconv.so. The libspconv.so in our repository is independent of other implementations. Therefore, you may not find the class in the third repository. Please feel free to commit any questions you encounter.

BR, Thanks

barrydoooit commented 1 week ago

Hi @hopef, Thank you for the clarification. It's a pity missing the chance to inspect the techniques used to achieve such significant acceleration. Nevertheless, may I learn about what could be the rationale for your custom spconv inference engine to run 3 times faster than the pytorch implementation? In other words, is the acceleration primarily credited to the refined memory management like in TensorRT, or to your algorithmic improvements in the SparseConv kernels?

hopef commented 1 week ago

First of all, I want to clarify that there is no leading or different acceleration technology here. The following might be worth considering:

  1. Inference on FP32, FP16 or INT8? Their speed gap is very significant.
  2. Reuse the rule book in different layers is important for inference latency.
  3. Some memory reuse strategies will be more friendly.
  4. Layer fusion has also been utilized.