In the README, it says "Here we extract two pure nn models from the whole computation graph---pfe and rpn, this is to make it easier for trt to optimize its inference engines, and we use cuda to connect these nn engines."
Is there any repo/docu/link/tutorial that supports this argument? i.e., why is it easier for trt to optimize its inference engines? (one onnx vs two onnxs)
TensorRT is developed to optimize nn inference, the connection part between pfe & rpn (voxel assigning) involves no nn computing, I don't think TRT would optimize that part of computation.
Since TensorRT is a like black box, I'd rather believe you should take over the non-nn computation part as much as possible, because you can control something like memory allocation or threads allocation and so on.
That is why I spilled into two part of nn graphs, and connected them using cuda.
In the README, it says "Here we extract two pure nn models from the whole computation graph---pfe and rpn, this is to make it easier for trt to optimize its inference engines, and we use cuda to connect these nn engines."
Is there any repo/docu/link/tutorial that supports this argument? i.e., why is it easier for trt to optimize its inference engines? (one onnx vs two onnxs)