TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
I also test awq method and A30 GPU, will hang too. Sometimes only one GPU is occupied 100%, and the other one is idle. both 0.6.1 and 0.7.1 have problems。
The above is all my information. I am not sure if it is caused by parallelism. I hope there are some debugging methods.
Here is my build command.
Model is Yi-34B, int4 weight only.
The system will get stuck after running for a period of time. This will only happen when penalty is set. Only in high concurrency situations.
Here is my test code.
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:56:00.0 Off | Off | | 30% 41C P2 69W / 450W | 22526MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:57:00.0 Off | Off | | 30% 39C P2 85W / 450W | 22524MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
all request will stuck.
I also test awq method and A30 GPU, will hang too. Sometimes only one GPU is occupied 100%, and the other one is idle. both 0.6.1 and 0.7.1 have problems。
The above is all my information. I am not sure if it is caused by parallelism. I hope there are some debugging methods.