Proposal: TRT Acceleration API improvement

borisfom commented 3 months ago

There is a number of issues with current TRT acceleration path in MONAI:

For some networks it's only practical/possible to trace/export certain sub-module, like image_encoder. Current solution requires the whole net to be exportable.
convert_to_trt() API does not actually work for networks with multiple inputs.
For those networks that can be converted to TRT as a whole, TRT export is still a separate extra command step, and arguments to that step are tied to network config, and there is no mechanism to check consistency of exported model.ts with the workflow config.
Encapsulation of TRT engine into .ts enforces dependency on torch_tensorrt which has to be rebuilt if TRT upgraded. And sometimes, TRT upgrade is needed to pick up latest functional and performance improvements.
Having separate inference.json and inference_trt.json configs is an inconvenience.

Describe the solution you'd like I would like to have a self-contained recipe in inference.json for TRT conversion that would work for multiple inputs with different data types and also would work for cases where only some submodules can be converted to TRT. Ideally, there should be no explicit TRT export pass, and conversion would happen 'on demand' during inference and cached. Persistent TRT engines would be rebuilt if I change config or update TRT version. I would like to have a single inference.json for TRT and non-TRT inference, and have TRT on/off switch on the command line.

Describe alternatives you've considered I explored few of those zoo models that still need TRT and I think we can come up with an elegant TRT acceleration solution, using extended network definitions in typical inference.json configs utilizing TRTWrapper (from https://github.com/Project-MONAI/MONAI/pull/7990) network_def would have an optional ‘trt’ section(s) describing which submodules (or the whole net) should be wrapped in TRTWrapper. convert_to_trt() can probably be fixed for multi-input/multi-datatype nets, especially if we add network_data_format to metadata.json for - but that would still require additional conversion pass(which has to be explicitly rerun each time TRT is upgraded or config changed) etc. Should be way more user-friendly with TRTWrapper.

In a nutshell, sample network_def would look like:

    "network_def": {
        "_target_": "SegResNet",
        "spatial_dims": 3,
         ...
        "trt" : [ 
            {
                "submodule": "image_encoder",
                "path": f"{bundle_root}/image_encoder",
                "precision" : "fp16",
                "build_args" : {
                    "builder_optimization_level": 5,
                    "precision_constraints": "obey",
                },
            },
          ...
          ]

So for each net, one can specify set of submodules that should be wrapped in TRTWrapper (or whole net if no submodule). “path” basename can probably also generated automatically from checkpoint name. but explicit would do as well. No other configuration or explicit TRT conversion pass would be needed for export! And generated .plan files would be automatically rebuilt on config file change. We would not need two separate inference.json and inference_trt.json configs - just a top-level ‘trt’ config flag, overridable by command line.

At workflow initialization time, the above config would result in replacement of “image_encoder” attribute of network with :

            ts = os.path.getmtime(config_file)
            net.setattr("image_encoder", TRTWrapper(
                net.getattr("image_encoder"),
                f"{bundle_root}/image_encoder",
                precision="fp16",
                build_args={
                    "builder_optimization_level": 5,
                    "precision_constraints": "obey",
                },
                timestamp=ts,
            )

binliunls commented 3 months ago

Hi @borisfom , Thanks for this wonderful proposal! It's very impressive.
For the proposal, I have three questions. Let's discuss them here.

I think we still need an optional to resuse the existing exported engine for the inference, instead of performing the conversion everytime we run inference, since the conversion itself takes a lot of time.
The TRTWrapper way seems like a PyTorch--> ONNX-->TRT (onnx-trt) way to perform the conversion. Can it cover all conversion cases. Is it possible that there are some cases can only be achieved through the PyTorch--> Torch-TensorRT (torch-trt) way? I think we may still need the torch_trt conversion if the onnx-trt way cannot cover all cases.
For the submodule part, I would suggest not to add the trt parameter to every network_def as shown in your netshell example. The network_def is used for instantiating a network (CNN, RNN, Transformer...). Because the network_def is usually an implement of a typical algorithm, the contributor of this algorithm may not have any knowledge about TensorRT. If we enforce every contributor to provide the trt parameter, it would be unreasonable. I think the better way is to add a new list-like class to patch submodules to the bundle as a fix parameter like trt_submodule. Then at the initilization part of the bundle we can replace the submodules if the fix parameter is provided.

Thanks Bin

borisfom commented 3 months ago

Hi @binliunls ! Thanks for the input! For 1): the way the wrapper currently works, it will reuse existing engine if it's already there first time we run inference (and has timestamp later than the config file) 2) Yes absolutely we should have Torch-TRT option (and later, dynamo). I will add it to existing TRTWrapper PR. 3) Agree - having trt section as an optional patch makes better sense.

borisfom commented 3 months ago

@binliunls : updated TRTWrapper PR with torch-tensorrt option (not tested yet, may need refinements).

borisfom commented 3 months ago

Here, I implemented first POC for brats_mri_generative_diffusion from model-zoo : https://github.com/Project-MONAI/model-zoo/pull/620

borisfom commented 3 months ago

@binliunls : I did some refactoring of TRTWrapper/trt_wrap API after working on a few use cases : now TRTWrapper is not supposed to replace the original module, but instead it's being saved in the original module and modules's forward() is being monkey-patched with a method that calls TRTWrapper. This way, wrapping can happen before checkpoint load (as model structure and therefore, state dict is unchanged). trt_wrap still returns the original model to facilitate use in configs. I have also added trt_handler class for alternative use of trt_wrap as a handler : check pathology_nuclei_classification use case (I did implement handler there but the default run seem to exit prematurely before it gets to the execution - am I missing some step in the initialization?)

binliunls commented 3 months ago

Hi @borisfom I have some suggestions for this new implement:

Could you please decouple the TRT conversion and the inference part? Otherwise, we must run inference after the conversion. However, in some cases, we just want to do the conversion(like the case without inference dataset).
We can retain the inference_trt.json file in the bundle. I saw you modified a few parameters in the inference.json file here. Actually, these modifications can be added to the inference_trt.json. Then when you run the inference, you can choose to run command python -m monai.bundle run --config_file "['configs/inference.json', 'configs/inference_trt.json']". It will merge the two configurations to a new one and run the infrence. In this way, we can maintain the original inference configuration.
Could you please provide an example conversion that contains some modules that cannot be converted through TRT and need to fallback to torch, like the vista3d case. This way we can show case how to set up the parameter for models who contain submodules that cannot be convert to TRT.

Thanks
Bin

borisfom commented 3 months ago

@binliunls :

Well it's coupled here by design. Decoupling conversion from inference is not very practical, as then one must supply the inputs - like your current conversion scripts do. Fully supporting models with multiple inputs and data types would require a bit more complex parsing. It's extremely error-prone. What is the actual use case when user needs to convert as a separate step before running inference ?
Yes it's viable option
The example of partial conversion can be found now in Vista2D PR: https://gitlab-master.nvidia.com/dlmed/cell_seg_foundation/-/merge_requests/21/diffs#588492bfd1686e5ddd610962e65408876a2a70ef_89_102 Vista3D PR also does it in different form (in infer.py) - I will be porting it to model zoo format once Vista3D is there.

Project-MONAI / MONAI

Proposal: TRT Acceleration API improvement #8018