NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
540 stars 39 forks source link

Quant Flux-dev OOM on L20 #72

Open hezeli123 opened 1 month ago

hezeli123 commented 1 month ago

How many GPU memory will be used to quant flux-dev ? Can be offload to cpu when not enough GPU memory ?

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['station'] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] Traceback (most recent call last): File "/cv/TensorRT-Model-Optimizer/diffusers/quantization/quantize.py", line 239, in main() File "/cv/TensorRT-Model-Optimizer/diffusers/quantization/quantize.py", line 234, in main backbone.to("cuda") File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to return self._apply(convert) File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply module._apply(fn) File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply module._apply(fn) File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply module._apply(fn) File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply param_applied = fn(param) File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert return t.to( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 44.32 GiB of which 65.25 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 43.87 GiB is allocated by PyTorch, and 41.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) /usr/lib/python3.10/tempfile.py:999: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp0_xw28ay'> _warnings.warn(warn_message, ResourceWarning)

jingyu-ml commented 1 month ago

@hezeli123 Yes, you can offload it to the CPU, but in the example script, we are using BF16 precision, and I'm not sure if the CPU supports that. If it doesn't, you can upscale the model and dummy input to FP32 before exporting it, and before the calibration, set the trt_high_precision_flag to fp32 or you can just use this config . However, you might encounter a performance issue if you directly use the trtexec command line on the FP32 quantized model. To achieve the best performance with quantization, it's recommended to convert the FP32 ONNX model to BF16 precision using ONNX Graph Surgeon or another suitable tool.

Just curious: Can you export the bf16 onnx model on the same GPU?

hezeli123 commented 1 month ago

@hezeli123 Yes, you can offload it to the CPU, but in the example script, we are using BF16 precision, and I'm not sure if the CPU supports that. If it doesn't, you can upscale the model and dummy input to FP32 before exporting it, and before the calibration, set the trt_high_precision_flag to fp32 or you can just use this config . However, you might encounter a performance issue if you directly use the trtexec command line on the FP32 quantized model. To achieve the best performance with quantization, it's recommended to convert the FP32 ONNX model to BF16 precision using ONNX Graph Surgeon or another suitable tool.

Just curious: Can you export the bf16 onnx model on the same GPU?

I can export the bf16 onnx model on L20, and can build bf16 trt engine.

hezeli123 commented 1 month ago

Is there any simple tools to build fp8/int8 trt engines from bf16 onnx model ? @jingyu-ml

hezeli123 commented 1 month ago

@jingyu-ml Could the quant method be applied to flux-schnell ?

jingyu-ml commented 1 month ago

yes, it can be applied to flux-schnell @hezeli123 .

thanks for the info, we are working on the memory issue, will let you know once there's any update

algorithmconquer commented 3 weeks ago

@hezeli123 @jingyu-ml I met the same problem when I run "bash build_sdxl_8bit_engine.sh(fp8, flux-dev)" on L40s