NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.03k stars 885 forks source link

[Issue]PTQ INT8 calibration OOM #587

Closed wjj19950828 closed 8 months ago

wjj19950828 commented 8 months ago

I refer to this Issue and want to use my own data set to obtain the scale value of SQ.

The scenario is 70b tp=2, the length of input_ids is not limited, instead of the default 923 in the script

But currently OOM occurs when the model is forwarded. I guess it is using the HF interface that causes OOM.

Is there any solution for this? Thanks~

Specific orders:

python3 hf_llama_convert.py -i $IN_MODEL -o $OUT_MODEL -sq 0.8 --tensor-parallelism 2 --storage-type fp16 -p 1
wjj19950828 commented 8 months ago

@wm2012011492

Do you have any suggestions? Thanks~

Tracin commented 8 months ago

Hi, which branch did you use?

wjj19950828 commented 8 months ago

@Tracin

I use the release/0.5.0 branch. Is there any update for 0.6.1?

Tracin commented 8 months ago

@wjj19950828 Please try with version 0.6.0 or later.