Open andakai opened 4 months ago
I build a new image with the latest version updated in https://github.com/NVIDIA/TensorRT-LLM/pull/1274, followed the doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/build_from_source.md#option-1-build-tensorrt-llm-in-one-step. I find that the quantize.py
has changed. However, when I run the quantization code on 2xA100-40G to quantize the model Aquilachat2-34B, an OOM still occurs.
python ../quantization/quantize.py --model_dir /tmp/AquilaChat2-34B \
--dtype float16 \
--qformat int4_awq \
--awq_block_size 128 \
--output_dir ./quantized_int4-awq \
--calib_size 1 \
Does the 34B model indeed needs such a huge memory?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I want to quantize the model, AquilaChat2-34B, which architecture is Llama-like. I want to quantize the model using the commands like Llama, https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama. I want to do the int8_kvcache+awq quantization.
Expected behavior
Quantize the model using int8_kvcache+awq method.
actual behavior
I try running the commands both out of and in the container, but both run into "CUDA OOM", the full log is:
additional notes
I also tried it on 4xA100-40G, still cuda oom.