NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.49k stars 813 forks source link

Llava multimodel example is giving segfault #1709

Open buddhapuneeth opened 1 month ago

buddhapuneeth commented 1 month ago

System Info

EC2 instance: G5.48xl Nvidia driver: 535.161.08 Cuda: 12.2 commit 5d8ca2faf74c494f220c8f71130340b513eea9a9 Torch: 2.3.0

Who can help?

@byshiue running into the issue with https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md#llava-and-vila example. Specifically trying llava 7B model. I tried with fp16, with tp=1 and 8 but error remains same. with tp= 1:

python run.py \
    --max_new_tokens 10 \
    --hf_model_dir /tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/llava-1 \
    --llm_engine_dir /tmp/trt_engines/${MODEL_NAME}/fp16/8-gpu \
    --input_text "Question: which city is this? Answer:"

with tp =8: (of course did chpt split and engine creation with tp=8)

mpirun -n 8 --allow-run-as-root python run.py \
    --max_new_tokens 10 \
    --hf_model_dir /tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/llava-1 \
    --llm_engine_dir /tmp/trt_engines/${MODEL_NAME}/fp16/8-gpu \
    --input_text "Question: which city is this? Answer:"

error:

[ip-172-31-86-214:58044] *** Process received signal ***
[ip-172-31-86-214:58044] Signal: Segmentation fault (11)
[ip-172-31-86-214:58044] Signal code: Address not mapped (1)
[ip-172-31-86-214:58044] Failing at address: 0x18
[06/01/2024-01:24:15] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
[ip-172-31-86-214:58044] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f09fc7538e0]
[ip-172-31-86-214:58044] [ 1] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN12tensorrt_llm4thop14TorchAllocator6mallocEmb+0x88)[0x7f06f00b1048]
[ip-172-31-86-214:58044] [ 2] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm6layers18DynamicDecodeLayerIfE14allocateBufferEv+0xd4)[0x7f07105abd14]
[ip-172-31-86-214:58044] [ 3] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm6layers18DynamicDecodeLayerIfE10initializeEv+0x128)[0x7f07105aeca8]
[ip-172-31-86-214:58044] [ 4] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm6layers18DynamicDecodeLayerIfEC1ERKNS_7runtime12DecodingModeERKNS0_13DecoderDomainEP11CUstream_stSt10shared_ptrINS_6common10IAllocatorEE+0xb1)[0x7f07105aeeb1]
[ip-172-31-86-214:58044] [ 5] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15FtDynamicDecodeIfEC2Emmmmii+0x270)[0x7f06f00908a0]
[ip-172-31-86-214:58044] [ 6] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOp14createInstanceEv+0x10f)[0x7f06f007413f]
[ip-172-31-86-214:58044] [ 7] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOpC1EllllllN3c1010ScalarTypeE+0x84)[0x7f06f0074204]
[ip-172-31-86-214:58044] [ 8] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_IN9torch_ext15DynamicDecodeOpEE12defineMethodIZNSB_3defIJllllllNS1_10ScalarTypeEEEERSB_NS7_6detail5typesIvJDpT_EEESsSt16initializer_listINS7_3argEEEUlNS1_14tagged_capsuleISA_EEllllllSE_E_EEPNS7_3jit8FunctionESsT_SsSN_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0xf8)[0x7f06f0091058]
[ip-172-31-86-214:58044] [ 9] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0xa0f34e)[0x7f09f2b6a34e]
[ip-172-31-86-214:58044] [10] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0xa0c8df)[0x7f09f2b678df]
[ip-172-31-86-214:58044] [11] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0xa0e929)[0x7f09f2b69929]
[ip-172-31-86-214:58044] [12] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x47de04)[0x7f09f25d8e04]
[ip-172-31-86-214:58044] [13] python(+0x13fb27)[0x55c68d224b27]
[ip-172-31-86-214:58044] [14] python(_PyObject_MakeTpCall+0x26b)[0x55c68d21e42b]
[ip-172-31-86-214:58044] [15] python(+0x14baa0)[0x55c68d230aa0]
[ip-172-31-86-214:58044] [16] python(+0xa40d6)[0x55c68d1890d6]
[ip-172-31-86-214:58044] [17] python(_PyObject_MakeTpCall+0x26b)[0x55c68d21e42b]
[ip-172-31-86-214:58044] [18] python(_PyEval_EvalFrameDefault+0x5596)[0x55c68d21a386]
[ip-172-31-86-214:58044] [19] python(_PyFunction_Vectorcall+0x6f)[0x55c68d224f8f]
[ip-172-31-86-214:58044] [20] python(_PyObject_FastCallDictTstate+0x185)[0x55c68d21d985]
[ip-172-31-86-214:58044] [21] python(+0x14934b)[0x55c68d22e34b]
[ip-172-31-86-214:58044] [22] python(_PyObject_MakeTpCall+0x2bb)[0x55c68d21e47b]
[ip-172-31-86-214:58044] [23] python(_PyEval_EvalFrameDefault+0x5a5e)[0x55c68d21a84e]
[ip-172-31-86-214:58044] [24] python(+0x14b641)[0x55c68d230641]
[ip-172-31-86-214:58044] [25] python(_PyEval_EvalFrameDefault+0x4d0d)[0x55c68d219afd]
[ip-172-31-86-214:58044] [26] python(+0x14b641)[0x55c68d230641]
[ip-172-31-86-214:58044] [27] python(_PyEval_EvalFrameDefault+0x13d0)[0x55c68d2161c0]
[ip-172-31-86-214:58044] [28] python(_PyFunction_Vectorcall+0x6f)[0x55c68d224f8f]
[ip-172-31-86-214:58044] [29] python(_PyEval_EvalFrameDefault+0x735)[0x55c68d215525]

Information

Tasks

Reproduction

Same steps as in https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md#llava-and-vila

Expected behavior

Need to provide inference result.

actual behavior

Seg fault

additional notes

n/a

nv-guomingz commented 1 month ago

hi @buddhapuneeth , thanks for reporting this issue, we'll try to reproduce your issue firstly.

nv-guomingz commented 1 month ago

Hi @buddhapuneeth , I tried to reproduce your issue on my local side with trtllm dev2024061100 but I failed.

My device is H100 and below is my output

python3 run.py --max_new_tokens=32 \                                                                                             2 ↵
               --hf_model_dir ./llava-1.5-7b-hf/ \
               --llm_engine_dir ./engine_outputs \
               --visual_engine_dir visual_engines/llava-1.5-7b-hf \
               --input_text "Question: which city is this? Answer:"

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
[06/13/2024-02:40:48] [TRT-LLM] [I] Loading engine from visual_engines/llava-1.5-7b-hf/visual_encoder.engine
[06/13/2024-02:40:51] [TRT-LLM] [I] Creating session from engine visual_engines/llava-1.5-7b-hf/visual_encoder.engine
[06/13/2024-02:40:51] [TRT] [I] Loaded engine size: 600 MiB
[06/13/2024-02:40:51] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +49, now: CPU 0, GPU 644 (MiB)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set dtype to float16.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set identity_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set nccl_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set lookup_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set lora_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set moe_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set context_fmha to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set remove_input_padding to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set reduce_fusion to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set multi_block_mode to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set enable_xqa to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set multiple_profiles to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set paged_state to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set streamingllm to False.
[06/13/2024-02:42:03] [TRT] [I] Loaded engine size: 12859 MiB
[06/13/2024-02:42:04] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13500 (MiB)
[06/13/2024-02:42:04] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13500 (MiB)
[06/13/2024-02:42:04] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
[06/13/2024-02:42:04] [TRT-LLM] [I] Load engine takes: 72.31985688209534 sec
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/13/2024-02:42:05] [TRT-LLM] [I] ---------------------------------------------------------
[06/13/2024-02:42:05] [TRT-LLM] [I]
[Q] Question: which city is this? Answer:
[06/13/2024-02:42:05] [TRT-LLM] [I]
[A]: ['Singapore']
[06/13/2024-02:42:05] [TRT-LLM] [I] Generated 1 tokens
[06/13/2024-02:42:05] [TRT-LLM] [I] ---------------------------------------------------------

Would u please have a try with latest code base?

github-actions[bot] commented 3 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."