Segmentation fault (11) on 1022dev+TRT 10.4.0

aliencaocao commented 3 weeks ago

System Info

CPU: x86_64
CPU mem: 64GB
GPU name: V100 SXM2 16GB and Tesla T4 15GB both happens
Libraries
TensorRT-LLM commit https://github.com/NVIDIA/TensorRT-LLM/tree/3c46c2794e7f6df48250a68de6240994a77a26a7 (the last commit before removal of Volta support)
TensorRT 10.4.0 TAR file
Python 3.12
CUDA 12.6
Driver 560.35.03
OS: Ubuntu 24.04 LTS Does not segfault using latest commit on T4, but because I need to use V100, so reverted. Since no wheel was provided for python 3.12, I built from src: python3 ./scripts/build_wheel.py --trt_root /home/ubuntu/TensorRT-LLM/TensorRT-10.4.0.26 --cuda_architectures "70-real;75-real" I can provide my built wheel download link via email if needed.

Using PyTorch 2.5.1+cu124

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Ran the following to convert checkpoint and build, all went thru no issue:

python3 ../opt/convert_checkpoint.py --model_type blip2 \
    --model_dir redacted/icon_caption \
    --output_dir redacted/trt_models/icon_caption \
    --dtype float16

trtllm-build \
    --checkpoint_dir redacted/trt_models/icon_caption \
    --output_dir redacted/trt_engines_v100/icon_caption \
    --gemm_plugin float16 \
    --max_beam_width 5 \
    --max_batch_size 20 \
    --max_seq_len 100 \
    --max_input_len 48 \
    --context_fmha disable \
    --multiple_profiles disable \
    --max_multimodal_len 640 \
    --opt_num_tokens 2000 \
    --workers 8 \
    --log_level verbose

python3 build_visual_engine.py --model_type blip2 --model_path redacted/icon_caption --output_dir redacted/trt_engines_v100/icon_caption/vision_encoder --max_batch_size 20

Then it will segfault when I try to test:

python3 run.py \
    --max_new_tokens 58 \
    --input_text "Question: which city is this? Answer:" \
    --hf_model_dir redacted/blip2_processor \
    --visual_engine_dir redacted/trt_engines_v100/icon_caption/vision_encoder \
    --llm_engine_dir redacted/trt_engines_v100/icon_caption \
    --temperature 0 \
    --num_beams 5 \
    --run_profiling \
    --profiling_iterations 50

Expected behavior

No segfault

actual behavior

Segfault log:

[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024102200
[TensorRT-LLM][INFO] Engine version 0.15.0.dev2024102200 found in the config file, assuming engine(s) built by new builder API.
[11/01/2024-10:00:53] [TRT-LLM] [I] Loading engine from redacted/trt_engines_v100/icon_caption/vision_encoder/model.engine
[11/01/2024-10:00:55] [TRT-LLM] [I] Creating session from engine redacted/trt_engines_v100/icon_caption/vision_encoder/model.engine
[11/01/2024-10:00:55] [TRT] [I] Loaded engine size: 2098 MiB
[11/01/2024-10:00:58] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +163, now: CPU 0, GPU 2247 (MiB)
[11/01/2024-10:00:58] [TRT-LLM] [I] Running LLM with C++ runner
[TensorRT-LLM][INFO] Engine version 0.15.0.dev2024102200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.15.0.dev2024102200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 20
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 20
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 5
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 100
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (100) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 2000
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 48 = max_input_len (in trtllm-build args)
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 5309 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 238.46 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 7550 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 22.54 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.39 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 15.77 GiB, available: 7.73 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 357
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.97 GiB for max tokens in paged KV cache (22848).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[11/01/2024-10:01:05] [TRT-LLM] [I] Load engine takes: 6.743446588516235 sec
[11/01/2024-10:01:05] [TRT-LLM] [I] downloading image from url https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
[ip-172-31-79-61:04136] *** Process received signal ***
[ip-172-31-79-61:04136] Signal: Segmentation fault (11)
[ip-172-31-79-61:04136] Signal code: Address not mapped (1)
[ip-172-31-79-61:04136] Failing at address: 0xc
[ip-172-31-79-61:04136] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x776bab445320]
[ip-172-31-79-61:04136] [ 1] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x97661)[0x77696ae97661]
[ip-172-31-79-61:04136] [ 2] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xbe81d)[0x77696aebe81d]
[ip-172-31-79-61:04136] [ 3] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xc7648)[0x77696aec7648]
[ip-172-31-79-61:04136] [ 4] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins18GPTAttentionPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x1a1)[0x77696aeaccc1]
[ip-172-31-79-61:04136] [ 5] /usr/local/lib/python3.12/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x113c92c)[0x776b2af3c92c]
[ip-172-31-79-61:04136] [ 6] /usr/local/lib/python3.12/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10e6947)[0x776b2aee6947]
[ip-172-31-79-61:04136] [ 7] /usr/local/lib/python3.12/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10e83e1)[0x776b2aee83e1]
[ip-172-31-79-61:04136] [ 8] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZNK12tensorrt_llm7runtime11TllmRuntime14executeContextEi+0x58)[0x7769e75f1828]
[ip-172-31-79-61:04136] [ 9] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching14executeContextEii+0x6b)[0x7769e78b840b]
[ip-172-31-79-61:04136] [10] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching11executeStepERKSt6vectorISt10shared_ptrINS0_10LlmRequestEESaIS5_EES9_i+0x41c)[0x7769e78cb15c]
[ip-172-31-79-61:04136] [11] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12executeBatchERKNS0_17ScheduledRequestsE+0xde)[0x7769e78cb56e]
[ip-172-31-79-61:04136] [12] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12forwardAsyncERKSt4listISt10shared_ptrINS0_10LlmRequestEESaIS5_EE+0x535)[0x7769e78cbb05]
[ip-172-31-79-61:04136] [13] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl12forwardAsyncERSt4listISt10shared_ptrINS_13batch_manager10LlmRequestEESaIS7_EE+0x195)[0x7769e78fdd65]
[ip-172-31-79-61:04136] [14] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl13executionLoopEv+0x5c3)[0x7769e79034f3]
[ip-172-31-79-61:04136] [15] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch.so(+0x145c0)[0x776ba78565c0]
[ip-172-31-79-61:04136] [16] /lib/x86_64-linux-gnu/libc.so.6(+0x9ca94)[0x776bab49ca94]
[ip-172-31-79-61:04136] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x776bab529c3c]
[ip-172-31-79-61:04136] *** End of error message ***
Segmentation fault (core dumped)

additional notes

The same model when running on HF transformers works fine.

hello-11 commented 3 weeks ago

@aliencaocao Thanks for your interest in TrtLLM. But unfortunately, we no longer support V100.

aliencaocao commented 3 weeks ago

the version i use is officially supported by V100, and also the latest stable release too. Which is the last version that is supposed to work?

hello-11 commented 3 weeks ago

@aliencaocao May I ask you what you changed for the build_wheel?

aliencaocao commented 3 weeks ago

My build wheel command is in the post. I did not change the script itself.

NVIDIA / TensorRT-LLM