NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.44k stars 954 forks source link

[TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal #456

Closed thendwk closed 10 months ago

thendwk commented 11 months ago

I encountered an issue when using mpirun. Let me describe how I used it. Firstly, i used the original command in the example, it worked successfully. mpirun -n 2 --allow-run-as-root \ python run.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits \ --engine_dir /docker_storage/trtModels/a10/2-gpu \ --max_output_len=512 --input_text "write a quick sort with python" Then, i wanted to deploy it as a server, so i started the program as a web server with flask. In the main function, i extracted the following codes: image Next, i started the program with following command: nohup mpirun -n 2 --allow-run-as-root \ python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu & image In the generate function, i extracted the following codes: image Finally, the rank0 web server receives user's request and processes generate function, mean while it sends an asynchronous request to the rank2 web server using threads. Unfortunately, the rank2 web server throws an exception as follows: terminate called after throwing an instance of 'std::runtime_error' what(): [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal [iZt4nb2wpoxhchvvv2wdahZ:01651] *** Process received signal *** [iZt4nb2wpoxhchvvv2wdahZ:01651] Signal: Aborted (6) [iZt4nb2wpoxhchvvv2wdahZ:01651] Signal code: (-6) [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f5fac31f520] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f5fac373a7c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f5fac31f476] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f5fac3057f3] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f5f09898b9e] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f5f098a420c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f5f098a31e9] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f5f098a3959] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f5fab48b884] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f5fab48c2dd] [iZt4nb2wpoxhchvvv2wdahZ:01651] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x104ede)[0x7f5dc3b2eede] [iZt4nb2wpoxhchvvv2wdahZ:01651] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins36WeightOnlyGroupwiseQuantMatmulPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x279)[0x7f5dc3b058f9] [iZt4nb2wpoxhchvvv2wdahZ:01651] [12] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9)[0x7f5ef7739fc9] [iZt4nb2wpoxhchvvv2wdahZ:01651] [13] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04)[0x7f5ef76fce04] [iZt4nb2wpoxhchvvv2wdahZ:01651] [14] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0)[0x7f5ef76fe9a0] [iZt4nb2wpoxhchvvv2wdahZ:01651] [15] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x9dc60)[0x7f5f0609dc60] [iZt4nb2wpoxhchvvv2wdahZ:01651] [16] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x42ea3)[0x7f5f06042ea3] [iZt4nb2wpoxhchvvv2wdahZ:01651] [17] python(+0x15fe0e)[0x555ba3746e0e] [iZt4nb2wpoxhchvvv2wdahZ:01651] [18] python(_PyObject_MakeTpCall+0x25b)[0x555ba373d5eb] [iZt4nb2wpoxhchvvv2wdahZ:01651] [19] python(+0x16e7bb)[0x555ba37557bb] [iZt4nb2wpoxhchvvv2wdahZ:01651] [20] python(_PyEval_EvalFrameDefault+0x6152)[0x555ba37358a2] [iZt4nb2wpoxhchvvv2wdahZ:01651] [21] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [22] python(_PyEval_EvalFrameDefault+0x802)[0x555ba372ff52] [iZt4nb2wpoxhchvvv2wdahZ:01651] [23] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [24] python(_PyEval_EvalFrameDefault+0x802)[0x555ba372ff52] [iZt4nb2wpoxhchvvv2wdahZ:01651] [25] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [26] python(_PyEval_EvalFrameDefault+0x802)[0x555ba372ff52] [iZt4nb2wpoxhchvvv2wdahZ:01651] [27] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [28] python(PyObject_Call+0x122)[0x555ba3756192] [iZt4nb2wpoxhchvvv2wdahZ:01651] [29] python(_PyEval_EvalFrameDefault+0x2b71)[0x555ba37322c1] [iZt4nb2wpoxhchvvv2wdahZ:01651] *** End of error message *** Exception in thread Thread-1 (make_request): Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request ha, it's a bit complicated, i'm not sure if i'm using this correctly.

byshiue commented 11 months ago

Can you run your sever with CUDA_LAUNCH_BLOCKING=1 again? For example,

CUDA_LAUNCH_BLOCKING=1 nohup mpirun -n 2 --allow-run-as-root \ python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu

and then send request to server.

thendwk commented 10 months ago

Can you run your sever with CUDA_LAUNCH_BLOCKING=1 again? For example,

CUDA_LAUNCH_BLOCKING=1 nohup mpirun -n 2 --allow-run-as-root \ python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu

and then send request to server.

thanks for your reply. i tried following command: CUDA_LAUNCH_BLOCKING=1 mpirun -n 2 --allow-run-as-root python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu

The rank2 web server throwed a new exception when processing generation: Traceback (most recent call last): File "/deploy/examples/llama/api_multi.py", line 46, in code_generation result = generate(request.data) File "/deploy/examples/llama/api_multi.py", line 97, in generate output_gen_ids = decoder.decode(input_ids, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 514, in wrapper ret = func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1887, in decode return self.decode_regular( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1659, in decode_regular should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits = self.handle_per_step( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1452, in handle_per_step raise RuntimeError('Executing TRT engine failed!') RuntimeError: Executing TRT engine failed! [11/24/2023-03:37:09] [TRT] [E] 1: [runner.cpp::executeMyelinGraph::681] Error Code 1: Myelin ([exec_instruction.cpp:exec:574] CUDA error 700 launching __myl_bb1_1_GatCasMulMeaAddSqrDivMulCasMul kernel.)

byshiue commented 10 months ago

I don't see helpful info from the error message. Could you help to prepare the reproduce steps? You could fork the repo and update your change on it to help reproducing your issue.

thendwk commented 10 months ago

I encountered an issue when using mpirun. Let me describe how I used it. Firstly, i used the original command in the example, it worked successfully. mpirun -n 2 --allow-run-as-root \ python run.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits \ --engine_dir /docker_storage/trtModels/a10/2-gpu \ --max_output_len=512 --input_text "write a quick sort with python" Then, i wanted to deploy it as a server, so i started the program as a web server with flask. In the main function, i extracted the following codes: image Next, i started the program with following command: nohup mpirun -n 2 --allow-run-as-root \ python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu & image In the generate function, i extracted the following codes: image Finally, the rank0 web server receives user's request and processes generate function, mean while it sends an asynchronous request to the rank2 web server using threads. Unfortunately, the rank2 web server throws an exception as follows: terminate called after throwing an instance of 'std::runtime_error' what(): [TensorRT-LLm Error][fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal [iZt4nb2wpoxhchvvv2wdahZ:01651] *** Process received signal *** [iZt4nb2wpoxhchvvv2wdahZ:01651] Signal: Aborted (6) [iZt4nb2wpoxhchvvv2wdahZ:01651] Signal code: (-6) [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f5fac31f520] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f5fac373a7c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f5fac31f476] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f5fac3057f3] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f5f09898b9e] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f5f098a420c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f5f098a31e9] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f5f098a3959] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f5fab48b884] [iZt4nb2wpoxhchvvv2wdahZ:01651] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f5fab48c2dd] [iZt4nb2wpoxhchvvv2wdahZ:01651] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x104ede)[0x7f5dc3b2eede] [iZt4nb2wpoxhchvvv2wdahZ:01651] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins36WeightOnlyGroupwiseQuantMatmulPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x279)[0x7f5dc3b058f9] [iZt4nb2wpoxhchvvv2wdahZ:01651] [12] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9)[0x7f5ef7739fc9] [iZt4nb2wpoxhchvvv2wdahZ:01651] [13] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04)[0x7f5ef76fce04] [iZt4nb2wpoxhchvvv2wdahZ:01651] [14] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0)[0x7f5ef76fe9a0] [iZt4nb2wpoxhchvvv2wdahZ:01651] [15] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x9dc60)[0x7f5f0609dc60] [iZt4nb2wpoxhchvvv2wdahZ:01651] [16] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x42ea3)[0x7f5f06042ea3] [iZt4nb2wpoxhchvvv2wdahZ:01651] [17] python(+0x15fe0e)[0x555ba3746e0e] [iZt4nb2wpoxhchvvv2wdahZ:01651] [18] python(_PyObject_MakeTpCall+0x25b)[0x555ba373d5eb] [iZt4nb2wpoxhchvvv2wdahZ:01651] [19] python(+0x16e7bb)[0x555ba37557bb] [iZt4nb2wpoxhchvvv2wdahZ:01651] [20] python(_PyEval_EvalFrameDefault+0x6152)[0x555ba37358a2] [iZt4nb2wpoxhchvvv2wdahZ:01651] [21] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [22] python(_PyEval_EvalFrameDefault+0x802)[0x555ba372ff52] [iZt4nb2wpoxhchvvv2wdahZ:01651] [23] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [24] python(_PyEval_EvalFrameDefault+0x802)[0x555ba372ff52] [iZt4nb2wpoxhchvvv2wdahZ:01651] [25] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [26] python(_PyEval_EvalFrameDefault+0x802)[0x555ba372ff52] [iZt4nb2wpoxhchvvv2wdahZ:01651] [27] python(_PyFunction_Vectorcall+0x7c)[0x555ba374770c] [iZt4nb2wpoxhchvvv2wdahZ:01651] [28] python(PyObject_Call+0x122)[0x555ba3756192] [iZt4nb2wpoxhchvvv2wdahZ:01651] [29] python(_PyEval_EvalFrameDefault+0x2b71)[0x555ba37322c1] [iZt4nb2wpoxhchvvv2wdahZ:01651] *** End of error message *** Exception in thread Thread-1 (make_request): Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request ha, it's a bit complicated, i'm not sure if i'm using this correctly.

@byshiue based on your suggestion, i added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" to api_multi.py and started the program again with the same command: nohup mpirun -n 2 --allow-run-as-root \ python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu & image Then i sent generation request to rank0 server, the rank0 server processed generation and sent generation request to rank2 server asynchronously, the rank2 server threw following exception: image

byshiue commented 10 months ago

I still not see helpful message. So, I think the full reproduced steps are required.

thendwk commented 10 months ago

I still not see helpful message. So, I think the full reproduced steps are required.

code_generation() method is a http method used for generation. main() method is a initialization method used for initializing tokenizer and decoder with different ranks async_request_runtime_rank() method is used for sending asynchronous request to other ranks for rank 0 starting command : mpirun -n 2 --allow-run-as-root \ python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu after starting, send a request to the rank 0 server with http method '/code_generation'

byshiue commented 10 months ago

Could you print some debug messages in generation.py to make sure the two process enter that and get the correct inputs (on specific GPU).

thendwk commented 10 months ago

Could you print some debug messages in generation.py to make sure the two process enter that and get the correct inputs (on specific GPU).

Thanks a lot! After i print some debug messages, i found the problem and now it worked successfully! Thanks again!

byshiue commented 10 months ago

Great. Thank you for the update. Close this bug. Feel free to reopen it if needed.

c6du commented 3 months ago

I still not see helpful message. So, I think the full reproduced steps are required.

code_generation() method is a http method used for generation. main() method is a initialization method used for initializing tokenizer and decoder with different ranks async_request_runtime_rank() method is used for sending asynchronous request to other ranks for rank 0 starting command : mpirun -n 2 --allow-run-as-root \ python api_multi.py --tokenizer_dir /docker_storage/CodeFuse-CodeLlama-34B-4bits --engine_dir /docker_storage/trtModels/a10/2-gpu after starting, send a request to the rank 0 server with http method '/code_generation'

Hi, I'm trying to do the same thing and I'm stuck at sending request to other ranks. Do you mind give an toy example about how you implement function async_request_runtime_rank(). THanks