Llava multimodel example is giving segfault

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

7.49k stars 813 forks source link

System Info

EC2 instance: G5.48xl Nvidia driver: 535.161.08 Cuda: 12.2 commit 5d8ca2faf74c494f220c8f71130340b513eea9a9 Torch: 2.3.0

Who can help?

@byshiue running into the issue with https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md#llava-and-vila example. Specifically trying llava 7B model. I tried with fp16, with tp=1 and 8 but error remains same. with tp= 1:

python run.py \
    --max_new_tokens 10 \
    --hf_model_dir /tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/llava-1 \
    --llm_engine_dir /tmp/trt_engines/${MODEL_NAME}/fp16/8-gpu \
    --input_text "Question: which city is this? Answer:"

with tp =8: (of course did chpt split and engine creation with tp=8)

mpirun -n 8 --allow-run-as-root python run.py \
    --max_new_tokens 10 \
    --hf_model_dir /tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/llava-1 \
    --llm_engine_dir /tmp/trt_engines/${MODEL_NAME}/fp16/8-gpu \
    --input_text "Question: which city is this? Answer:"

error:

[ip-172-31-86-214:58044] *** Process received signal ***
[ip-172-31-86-214:58044] Signal: Segmentation fault (11)
[ip-172-31-86-214:58044] Signal code: Address not mapped (1)
[ip-172-31-86-214:58044] Failing at address: 0x18
[06/01/2024-01:24:15] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
[ip-172-31-86-214:58044] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f09fc7538e0]
[ip-172-31-86-214:58044] [ 1] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN12tensorrt_llm4thop14TorchAllocator6mallocEmb+0x88)[0x7f06f00b1048]
[ip-172-31-86-214:58044] [ 2] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm6layers18DynamicDecodeLayerIfE14allocateBufferEv+0xd4)[0x7f07105abd14]
[ip-172-31-86-214:58044] [ 3] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm6layers18DynamicDecodeLayerIfE10initializeEv+0x128)[0x7f07105aeca8]
[ip-172-31-86-214:58044] [ 4] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm6layers18DynamicDecodeLayerIfEC1ERKNS_7runtime12DecodingModeERKNS0_13DecoderDomainEP11CUstream_stSt10shared_ptrINS_6common10IAllocatorEE+0xb1)[0x7f07105aeeb1]
[ip-172-31-86-214:58044] [ 5] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15FtDynamicDecodeIfEC2Emmmmii+0x270)[0x7f06f00908a0]
[ip-172-31-86-214:58044] [ 6] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOp14createInstanceEv+0x10f)[0x7f06f007413f]
[ip-172-31-86-214:58044] [ 7] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext15DynamicDecodeOpC1EllllllN3c1010ScalarTypeE+0x84)[0x7f06f0074204]
[ip-172-31-86-214:58044] [ 8] /opt/conda/lib/python3.10/site-packages/tensorrt_llm/libs/libth_common.so(_ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_IN9torch_ext15DynamicDecodeOpEE12defineMethodIZNSB_3defIJllllllNS1_10ScalarTypeEEEERSB_NS7_6detail5typesIvJDpT_EEESsSt16initializer_listINS7_3argEEEUlNS1_14tagged_capsuleISA_EEllllllSE_E_EEPNS7_3jit8FunctionESsT_SsSN_EUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0xf8)[0x7f06f0091058]
[ip-172-31-86-214:58044] [ 9] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0xa0f34e)[0x7f09f2b6a34e]
[ip-172-31-86-214:58044] [10] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0xa0c8df)[0x7f09f2b678df]
[ip-172-31-86-214:58044] [11] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0xa0e929)[0x7f09f2b69929]
[ip-172-31-86-214:58044] [12] /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x47de04)[0x7f09f25d8e04]
[ip-172-31-86-214:58044] [13] python(+0x13fb27)[0x55c68d224b27]
[ip-172-31-86-214:58044] [14] python(_PyObject_MakeTpCall+0x26b)[0x55c68d21e42b]
[ip-172-31-86-214:58044] [15] python(+0x14baa0)[0x55c68d230aa0]
[ip-172-31-86-214:58044] [16] python(+0xa40d6)[0x55c68d1890d6]
[ip-172-31-86-214:58044] [17] python(_PyObject_MakeTpCall+0x26b)[0x55c68d21e42b]
[ip-172-31-86-214:58044] [18] python(_PyEval_EvalFrameDefault+0x5596)[0x55c68d21a386]
[ip-172-31-86-214:58044] [19] python(_PyFunction_Vectorcall+0x6f)[0x55c68d224f8f]
[ip-172-31-86-214:58044] [20] python(_PyObject_FastCallDictTstate+0x185)[0x55c68d21d985]
[ip-172-31-86-214:58044] [21] python(+0x14934b)[0x55c68d22e34b]
[ip-172-31-86-214:58044] [22] python(_PyObject_MakeTpCall+0x2bb)[0x55c68d21e47b]
[ip-172-31-86-214:58044] [23] python(_PyEval_EvalFrameDefault+0x5a5e)[0x55c68d21a84e]
[ip-172-31-86-214:58044] [24] python(+0x14b641)[0x55c68d230641]
[ip-172-31-86-214:58044] [25] python(_PyEval_EvalFrameDefault+0x4d0d)[0x55c68d219afd]
[ip-172-31-86-214:58044] [26] python(+0x14b641)[0x55c68d230641]
[ip-172-31-86-214:58044] [27] python(_PyEval_EvalFrameDefault+0x13d0)[0x55c68d2161c0]
[ip-172-31-86-214:58044] [28] python(_PyFunction_Vectorcall+0x6f)[0x55c68d224f8f]
[ip-172-31-86-214:58044] [29] python(_PyEval_EvalFrameDefault+0x735)[0x55c68d215525]

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Same steps as in https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md#llava-and-vila

Expected behavior

Need to provide inference result.

actual behavior

Seg fault

additional notes

n/a

python3 run.py --max_new_tokens=32 \ 2 ↵ --hf_model_dir ./llava-1.5-7b-hf/ \ --llm_engine_dir ./engine_outputs \ --visual_engine_dir visual_engines/llava-1.5-7b-hf \ --input_text "Question: which city is this? Answer:" [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [06/13/2024-02:40:48] [TRT-LLM] [I] Loading engine from visual_engines/llava-1.5-7b-hf/visual_encoder.engine [06/13/2024-02:40:51] [TRT-LLM] [I] Creating session from engine visual_engines/llava-1.5-7b-hf/visual_encoder.engine [06/13/2024-02:40:51] [TRT] [I] Loaded engine size: 600 MiB [06/13/2024-02:40:51] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +49, now: CPU 0, GPU 644 (MiB) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [06/13/2024-02:42:03] [TRT-LLM] [I] Set dtype to float16. [06/13/2024-02:42:03] [TRT-LLM] [I] Set bert_attention_plugin to auto. [06/13/2024-02:42:03] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [06/13/2024-02:42:03] [TRT-LLM] [I] Set gemm_plugin to float16. [06/13/2024-02:42:03] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set identity_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set layernorm_quantization_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set nccl_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set lookup_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set lora_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None. [06/13/2024-02:42:03] [TRT-LLM] [I] Set quantize_per_token_plugin to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set quantize_tensor_plugin to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set moe_plugin to auto. [06/13/2024-02:42:03] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/13/2024-02:42:03] [TRT-LLM] [I] Set context_fmha to True. [06/13/2024-02:42:03] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set paged_kv_cache to True. [06/13/2024-02:42:03] [TRT-LLM] [I] Set remove_input_padding to True. [06/13/2024-02:42:03] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/13/2024-02:42:03] [TRT-LLM] [I] Set reduce_fusion to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set multi_block_mode to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set enable_xqa to True. [06/13/2024-02:42:03] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set tokens_per_block to 64. [06/13/2024-02:42:03] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set multiple_profiles to False. [06/13/2024-02:42:03] [TRT-LLM] [I] Set paged_state to True. [06/13/2024-02:42:03] [TRT-LLM] [I] Set streamingllm to False. [06/13/2024-02:42:03] [TRT] [I] Loaded engine size: 12859 MiB [06/13/2024-02:42:04] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13500 (MiB) [06/13/2024-02:42:04] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13500 (MiB) [06/13/2024-02:42:04] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime. [06/13/2024-02:42:04] [TRT-LLM] [I] Load engine takes: 72.31985688209534 sec Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) return _nested.nested_tensor( [06/13/2024-02:42:05] [TRT-LLM] [I] --------------------------------------------------------- [06/13/2024-02:42:05] [TRT-LLM] [I] [Q] Question: which city is this? Answer: [06/13/2024-02:42:05] [TRT-LLM] [I] [A]: ['Singapore'] [06/13/2024-02:42:05] [TRT-LLM] [I] Generated 1 tokens [06/13/2024-02:42:05] [TRT-LLM] [I] ---------------------------------------------------------

NVIDIA / TensorRT-LLM