Open mallorbc opened 9 months ago
Smooth quant seems broken as well.
Our latest main branch doesn't contain build.py under examples/llama path. Are u using a legacy version code base? Please refer to new workflow doc for details with our latest code.
I am using v0.7.1. The latest tag
Please try main branch if possible since our coming release also will use new build workflow.
I am using this software as well as tensorrtllm_backend.
I forget which project was having issues, but I was unable to build the docker image then.
I will try again for the quantized models. Bfloat16 seems to be working fine.
@nv-guomingz correct me if I am wrong but the tensort-llm currently is only compatible with TensorRT-LLM v0.7.1?
@mallorbc i got TensorRT-LLM v0.7.1 working with tensorrtllm_backend v0.7.2 with this docker run command
docker run --rm -it -p 0.0.0.0:8000:8000 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $(pwd)/all_models:/all_models \
-v $(pwd)/scripts:/opt/scripts \
-v ${HOME}/.cache/huggingface/:/root/.cache/huggingface/ \
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 bash
The last part 24.01
is important
Reason main version of TensorRT-LLM is not compatible with backend
@mallorbc Do you still have the problem? If not, we will close it soon.
System Info
Using 1 a100 GPU. Using Nvidia-docker
slightly modified Dockerfile:
Who can help?
@trac
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
examples/llama
folderFor example:
It will fail
Expected behavior
I am not sure if the input is supposed to be an existing GPTQ model, or if it will implement it(I think it is the latter)
Either way, some other warning(in the first case), or a GPTQ model engine should be made.
actual behavior
It errors out.
additional notes
I tried other quants like awq as well. Same issue.
If the issue is related to pytorch changes in docker image, I had to do that to solve another issue with Tensorrt-llm