Can‘t run converted model

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.67k stars 990 forks source link

System Info

tensorrt-llm 0.8

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python3 ../run.py --input_text "你好，请问你叫什么？" \ --max_output_len=50 \ --tokenizer_dir /home/name/.cache/huggingface/hub/models--Qwen--Qwen-7B-Chat-Int4/snapshots/5a9a1ed9203a3d4ccc62f24fbb9b7e3948ce8ec1 \ --engine_dir=./tmp/Qwen/7B/trt_engines/int4-gptq/1-gpu

Expected behavior

return answer

actual behavior

(python3_10) name@nvidia-desktop:~/code/TensorRT-LLM/examples/qwen$ python3 ../run.py --input_text "你好，请问你叫什么？" --max_output_len=10 --tokenizer_dir /home/name/.cache/huggingface/hub/models--Qwen--Qwen-7B-Chat-Int4/snapshots/5a9a1ed9203a3d4ccc62f24fbb9b7e3948ce8ec2 --engine_dir=./tmp/Qwen/7B/trt_engines/int4-gptq/1-gpu/ [TensorRT-LLM] TensorRT-LLM version: 0.8.0.dev20240123, commit: b57221b764bc579cbb2490154916a871f620e2c4 Authorization required, but no authorization protocol specified Authorization required, but no authorization protocol specified Authorization required, but no authorization protocol specified [nvidia-desktop:839642] Process received signal [nvidia-desktop:839642] Signal: Segmentation fault (11) [nvidia-desktop:839642] Signal code: Address not mapped (1) [nvidia-desktop:839642] Failing at address: 0x440000f1 [nvidia-desktop:839642] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffa86eb7bc] [nvidia-desktop:839642] [ 1] /lib/aarch64-linux-gnu/libmpi.so.40(PMPI_Comm_set_errhandler+0x50)[0xffff92b582a0] [nvidia-desktop:839642] [ 2] /home/wanghaikuan/anaconda3/envs/python3_10/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-aarch64-linux-gnu.so(+0x2cb58)[0xfffe47c7cb58] [nvidia-desktop:839642] [ 3] python3(PyModule_ExecDef+0x68)[0xaaaadbc5a778] [nvidia-desktop:839642] [ 4] python3(+0x1380e4)[0xaaaadbce80e4] [nvidia-desktop:839642] [ 5] python3(+0x219590)[0xaaaadbdc9590] [nvidia-desktop:839642] [ 6] python3(_PyEval_EvalFrameDefault+0x137c)[0xaaaadbc16b0c] [nvidia-desktop:839642] [ 7] python3(+0x117460)[0xaaaadbcc7460] [nvidia-desktop:839642] [ 8] python3(_PyEval_EvalFrameDefault+0x6a30)[0xaaaadbc1c1c0] [nvidia-desktop:839642] [ 9] python3(+0x117460)[0xaaaadbcc7460] [nvidia-desktop:839642] [10] python3(_PyEval_EvalFrameDefault+0x6350)[0xaaaadbc1bae0] [nvidia-desktop:839642] [11] python3(+0x117460)[0xaaaadbcc7460] [nvidia-desktop:839642] [12] python3(_PyEval_EvalFrameDefault+0x5950)[0xaaaadbc1b0e0] [nvidia-desktop:839642] [13] python3(+0x117460)[0xaaaadbcc7460] [nvidia-desktop:839642] [14] python3(_PyEval_EvalFrameDefault+0x5950)[0xaaaadbc1b0e0] [nvidia-desktop:839642] [15] python3(+0x117460)[0xaaaadbcc7460] [nvidia-desktop:839642] [16] python3(+0x7cc84)[0xaaaadbc2cc84] [nvidia-desktop:839642] [17] python3(_PyObject_CallMethodIdObjArgs+0xe0)[0xaaaadbc2d054] [nvidia-desktop:839642] [18] python3(PyImport_ImportModuleLevelObject+0x650)[0xaaaadbceb7f4] [nvidia-desktop:839642] [19] python3(+0x244a74)[0xaaaadbdf4a74] [nvidia-desktop:839642] [20] python3(+0x218edc)[0xaaaadbdc8edc] [nvidia-desktop:839642] [21] python3(_PyObject_Call+0x68)[0xaaaadbc2c0e8] [nvidia-desktop:839642] [22] python3(_PyEval_EvalFrameDefault+0x137c)[0xaaaadbc16b0c] [nvidia-desktop:839642] [23] python3(+0x117460)[0xaaaadbcc7460] [nvidia-desktop:839642] [24] python3(_PyEval_EvalFrameDefault+0x5950)[0xaaaadbc1b0e0] [nvidia-desktop:839642] [25] python3(+0x117460)[0xaaaadbcc7460] [nvidia-desktop:839642] [26] python3(+0x7cc84)[0xaaaadbc2cc84] [nvidia-desktop:839642] [27] python3(_PyObject_CallMethodIdObjArgs+0xe0)[0xaaaadbc2d054] [nvidia-desktop:839642] [28] python3(PyImport_ImportModuleLevelObject+0x2c8)[0xaaaadbceb46c] [nvidia-desktop:839642] [29] python3(_PyEval_EvalFrameDefault+0x1d68)[0xaaaadbc174f8] [nvidia-desktop:839642] End of error message Segmentation fault (core dumped)

additional notes

None

when i changed to : runtime_rank=0(origin is : runtime_rank = tensorrt_llm.mpi_rank()) erorr changes to: (python3_10) wanghaikuan@nvidia-desktop:~/code/TensorRT-LLM/examples/qwen$ python ../run.py --input_text "你好，请问你叫什么？" --max_output_len=50 --tokenizer_dir /home/wanghaikuan/.cache/huggingface/hub/models--Qwen--Qwen-7B-Chat-Int4/snapshots/5a9a1ed9203a3d4ccc62f24fbb9b7e3948ce8ec1 --engine_dir=./tmp/Qwen/7B/trt_engines/int4-gptq/1-gpu [TensorRT-LLM] TensorRT-LLM version: 0.8.0.dev20240123, commit: b57221b764bc579cbb2490154916a871f620e2c4 [TensorRT-LLM][WARNING] Parameter version cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'version' not found [TensorRT-LLM][WARNING] Parameter mlp_hidden_size cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'mlp_hidden_size' not found [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Parameter gather_context_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_context_logits' not found [TensorRT-LLM][WARNING] Parameter gather_generation_logits cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_generation_logits' not found [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 Authorization required, but no authorization protocol specified Authorization required, but no authorization protocol specified Authorization required, but no authorization protocol specified [TensorRT-LLM][INFO] MPI size: 1, rank: 0 [TensorRT-LLM][INFO] Loaded engine size: 5565 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5755, GPU 16564 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +14, now: CPU 5756, GPU 16578 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5561, now: CPU 0, GPU 5561 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 5768, GPU 19147 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 5768, GPU 19147 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5561 (MiB) Traceback (most recent call last): File "/home/wanghaikuan/code/TensorRT-LLM/examples/qwen/../run.py", line 498, in main(args) File "/home/wanghaikuan/code/TensorRT-LLM/examples/qwen/../run.py", line 376, in main outputs = runner.generate( File "/home/wanghaikuan/anaconda3/envs/python3_10/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 301, in generate batch_input_ids = batch_input_ids.cuda() RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

NVIDIA / TensorRT-LLM