Open ovcharenkoo opened 1 year ago
@ovcharenkoo Thanks for reporting this. Can you share the following information to us firstly?
Based on the concrete information , we will try to reproduce it firstly.
June
Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.
Hi June,
Branch: release/0.5.0 GPU: H100 CUDA: 12.2 Driver: 525.147.05
Same error remains when trying to do AWQ quantization on HF Llama-7b-chat
python quantize.py --model_dir /code/tensorrt_llm/llms/models--meta-llama--Llama-2-7b-chat-hf \
--dtype float16 \
--qformat int4_awq \
--export_path /code/tensorrt_llm/llms/llama2-7b-4bit-gs128-awq.pt \
--calib_size 32 \
I build container with make -C docker release_build CUDA_ARCHS="86-real;90-real"
Could you take a try on latest main branch with new command.
Hi all,
I am trying to follow the instruction for INT8 weight only + INT8 KV cache for Llama2-13b.
Following the README I run the conversion script from inside the container
And get the following error
The same error occurs when trying to do the SmoothQuant optimization
What should I recompile and how?
Thanks