NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
574 stars 43 forks source link

Does Model Optimizer work Windows? #68

Closed adaber closed 2 months ago

adaber commented 2 months ago

Hi,

The main GitHub page lists only Linux but has anyone tested Model Optimizer on Windows ?

Thanks!

riyadshairi979 commented 2 months ago

Unofficially we used features from modelopt.onnx.quantization on windows, the pip installation instruction is same. Some of the modelopt.torch features might work as well.

adaber commented 2 months ago

Thanks for the quick response, @riyadshairi979 !

My plan is to use modelopt.torch.quantization initially (int8 quantization). I guess I should try and see if it works.

Do you guys plan on Model Optimizer supporting Windows at some point officially , too ?

Thanks!

riyadshairi979 commented 2 months ago

My plan is to use modelopt.torch.quantization initially (int8 quantization).

If your deployment runtime is TensorRT, then we recommend to use modelopt.onnx.quantization. Exported ONNX after INT8 PTQ with modelopt.torch.quantization is not optimal on TensorRT.

Windows at some point officially , too ?

Yes we are working on official support for windows.

adaber commented 2 months ago

Hi @riyadshairi979,

Firstly, thank you for your help!

If your deployment runtime is TensorRT, then we recommend to use modelopt.onnx.quantization. Exported ONNX after INT8 PTQ with modelopt.torch.quantization is not optimal on TensorRT.

Thanks for sharing this important information. It will definitely save me time because I know what sub-package I should focus on. I will give it a try and report the results. I may follow up with a question or two regarding this particular sub-package.

Windows at some point officially , too ?

Yes we are working on official support for windows.

It's great to hear that.

Thanks!

adaber commented 2 months ago

I guess I already have 2 questions.

1) I assume that Model Optimizer does the calculations on a GPU but I couldn't find an option that allows me to pick what GPU to use in case there is more than one GPU in the system. Is there one ? (I might've missed it, though).

2) Does modelopt.onnx.quantization work with dynamic input shapes ? I did a quick test and "Tensor shape doesn't match for input" popped up. I can see that the ModelOpt's code compares the input_shape and calibration_data sizes and I assume that the input tensor's dynamic sizes might be what caused the assertion error.

Thanks!

riyadshairi979 commented 2 months ago
  1. By default modelopt uses cpu for calibration, we will provide a way for users to choose the ExecutionProviders like gpu etc.
  2. We will add the dynamic shape support in next release.
adaber commented 2 months ago

@riyadshairi979 Thanks for the quick response. It's very appreciated

Do you happen to know what the approximate time frames are for adding the dynamic shape and GPU support ?

Thanks!

riyadshairi979 commented 2 months ago

Our next release of modelopt v0.19 will have dynamic shape and GPU support, which will be released on 21 Oct 2024.

adaber commented 2 months ago

That's great to hear, riyadshairi979. Thanks!

riyadshairi979 commented 2 months ago

@adaber do you have any link to sample ONNX model with dynamic shapes and calibration data to test with? I assume that you are interested in dynamic shapes other than batch dimension. If thats the case, how the calibration tensor shape looks like if the corresponding input tensor has multiple dynamic dimensions, say a shape like [batch_size, 8, dim_2, 16].

adaber commented 1 month ago

@riyadshairi979

Sorry for the late response. Didn't think you'd post here and therefore I didn't check if the thread was updated.

I'd try to help as much as I can since we really want to use this tool for model quantization in the future : )

I use fully convolutional neural networks for semantic segmentation and so my dynamic input shape sizes are usually [-1, -1, -1, 3] (batchSize x H x W x NumChannels). It also could be [-1, 3, -1, -1] (batchSize x NumChannels x H x W). I'm not sure if I can provide any samples since they belong to the company I work for but you can just use a UNet model and create simple synthetic samples for sem. segm.

Dynamic shape input and GPU support are crucial for this tool to be efficient for sem. segm. model quantization and so please, let me know if there is anything else I can help with.

Thanks!