Open hawkeoni opened 4 months ago
i am doing same fp8 quantization but on llama-2 34b model and using 4xH100 and i am facing the same issue as well
Saw similar issue with A10G + llama3 8B. Also solved this issue with similar tricks by manually edit the source code in ammo to move tensors from gpu to cpu.
trt-llm will add the --device
knob in coming release, then you can specify the --device cpu
to avoid such oom issues.
System Info
System Info
/proc/meminfo
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
tl;dr - Mixtral quantization fails on 2xH100 80gb gpus and a propose a small fix for it into nvidia-ammo.
Hi! I've been trying to convert mixtral in fp8 using the latest version of TensorRT-LLM but I had an OOM error:
I'm launching the convertation using the official script from here:
I've had 2 H100 gpus with 81gbs of memory and the first gpu had 80gb of memory allocated and the second had 40gb and the convertation failed.
Expected behavior
Successful model convertation
actual behavior
OOM Failure
additional notes
I've managed to fix it by going deep into nvidia-ammo. My version is:
I've found that in the file
ammo/torch/export/layer_utils.py
in function_build_stacked_linear
tensor concatenation results in OOM so I fixed it by moving tensors on the cpuAnd it fixed the problem and my gpus both had 46gb used.
I do not have any access to nvidia-ammo codebase, hopefully this helps everyone running into this issue and maybe some sort of
.cpu()
fix can be merged into nvidia-ammo for the next release?