Open hezeli123 opened 1 month ago
@hezeli123 Yes, you can offload it to the CPU, but in the example script, we are using BF16 precision, and I'm not sure if the CPU supports that. If it doesn't, you can upscale the model and dummy input to FP32 before exporting it, and before the calibration, set the trt_high_precision_flag to fp32 or you can just use this config . However, you might encounter a performance issue if you directly use the trtexec command line on the FP32 quantized model. To achieve the best performance with quantization, it's recommended to convert the FP32 ONNX model to BF16 precision using ONNX Graph Surgeon or another suitable tool.
Just curious: Can you export the bf16 onnx model on the same GPU?
@hezeli123 Yes, you can offload it to the CPU, but in the example script, we are using BF16 precision, and I'm not sure if the CPU supports that. If it doesn't, you can upscale the model and dummy input to FP32 before exporting it, and before the calibration, set the trt_high_precision_flag to fp32 or you can just use this config . However, you might encounter a performance issue if you directly use the trtexec command line on the FP32 quantized model. To achieve the best performance with quantization, it's recommended to convert the FP32 ONNX model to BF16 precision using ONNX Graph Surgeon or another suitable tool.
Just curious: Can you export the bf16 onnx model on the same GPU?
I can export the bf16 onnx model on L20, and can build bf16 trt engine.
Is there any simple tools to build fp8/int8 trt engines from bf16 onnx model ? @jingyu-ml
@jingyu-ml Could the quant method be applied to flux-schnell ?
yes, it can be applied to flux-schnell @hezeli123 .
thanks for the info, we are working on the memory issue, will let you know once there's any update
@hezeli123 @jingyu-ml I met the same problem when I run "bash build_sdxl_8bit_engine.sh(fp8, flux-dev)" on L40s
How many GPU memory will be used to quant flux-dev ? Can be offload to cpu when not enough GPU memory ?
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['station'] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00, 1.01it/s] Traceback (most recent call last): File "/cv/TensorRT-Model-Optimizer/diffusers/quantization/quantize.py", line 239, in
main()
File "/cv/TensorRT-Model-Optimizer/diffusers/quantization/quantize.py", line 234, in main
backbone.to("cuda")
File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/home/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 44.32 GiB of which 65.25 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 43.87 GiB is allocated by PyTorch, and 41.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/usr/lib/python3.10/tempfile.py:999: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp0_xw28ay'>
_warnings.warn(warn_message, ResourceWarning)