NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
574 stars 43 forks source link

Can we use multi GPU while exporting (diffusers ) onnx model? #96

Open wxsms opened 3 weeks ago

wxsms commented 3 weeks ago

I'm building a SDXL model in float16 using 4090x2, therefore the GPU memory available is ~48GB.

however, the script in diffusers/quantizatoin does not looks like to able to use both of them, and raise OOM error while exporting onnx model.

I tried to export the model using CPU, but it's too slow.

jingyu-ml commented 3 weeks ago

@wxsma could you try something like this?

        backbone.eval()
        with torch.no_grad():
            modelopt_export_sd(backbone, f"{str(args.onnx_dir)}", args.model, args.format)

And also move other parts to cpu, like vae and clip. Please let me know if it works.

wxsms commented 3 weeks ago

@wxsma could you try something like this?

    backbone.eval()
    with torch.no_grad():
        modelopt_export_sd(backbone, f"{str(args.onnx_dir)}", args.model, args.format)

And also move other parts to cpu, like vae and clip. Please let me know if it works.

Sadly it does not work. I manage to export the onnx model in A800 and complie in 4090.

jingyu-ml commented 2 weeks ago

I'll take a look and get back to you, I barely tested on 4090. Just to confirm, can you export the FP16 SDXL on a 4090?

wxsms commented 2 weeks ago

Thank you, I will try it later.

ZhenshengWu commented 3 days ago

Has there been any progress on this issue? I encountered the same problem on an RTX 4090. Eventually, I performed the ONNX model conversion on an A800. Using nvidia-smi, I noticed that the ONNX conversion process requires around 30GB of VRAM

model:SDXL-1.0