Open buddhapuneeth opened 1 month ago
hi @buddhapuneeth , thanks for reporting this issue, we'll try to reproduce your issue firstly.
Hi @buddhapuneeth , I tried to reproduce your issue on my local side with trtllm dev2024061100 but I failed.
My device is H100 and below is my output
python3 run.py --max_new_tokens=32 \ 2 ↵
--hf_model_dir ./llava-1.5-7b-hf/ \
--llm_engine_dir ./engine_outputs \
--visual_engine_dir visual_engines/llava-1.5-7b-hf \
--input_text "Question: which city is this? Answer:"
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
[06/13/2024-02:40:48] [TRT-LLM] [I] Loading engine from visual_engines/llava-1.5-7b-hf/visual_encoder.engine
[06/13/2024-02:40:51] [TRT-LLM] [I] Creating session from engine visual_engines/llava-1.5-7b-hf/visual_encoder.engine
[06/13/2024-02:40:51] [TRT] [I] Loaded engine size: 600 MiB
[06/13/2024-02:40:51] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +49, now: CPU 0, GPU 644 (MiB)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set dtype to float16.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set identity_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set nccl_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set lookup_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set lora_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set moe_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set context_fmha to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set remove_input_padding to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set reduce_fusion to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set multi_block_mode to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set enable_xqa to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set multiple_profiles to False.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set paged_state to True.
[06/13/2024-02:42:03] [TRT-LLM] [I] Set streamingllm to False.
[06/13/2024-02:42:03] [TRT] [I] Loaded engine size: 12859 MiB
[06/13/2024-02:42:04] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13500 (MiB)
[06/13/2024-02:42:04] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13500 (MiB)
[06/13/2024-02:42:04] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
[06/13/2024-02:42:04] [TRT-LLM] [I] Load engine takes: 72.31985688209534 sec
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[06/13/2024-02:42:05] [TRT-LLM] [I] ---------------------------------------------------------
[06/13/2024-02:42:05] [TRT-LLM] [I]
[Q] Question: which city is this? Answer:
[06/13/2024-02:42:05] [TRT-LLM] [I]
[A]: ['Singapore']
[06/13/2024-02:42:05] [TRT-LLM] [I] Generated 1 tokens
[06/13/2024-02:42:05] [TRT-LLM] [I] ---------------------------------------------------------
Would u please have a try with latest code base?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
EC2 instance: G5.48xl Nvidia driver: 535.161.08 Cuda: 12.2 commit 5d8ca2faf74c494f220c8f71130340b513eea9a9 Torch: 2.3.0
Who can help?
@byshiue running into the issue with https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md#llava-and-vila example. Specifically trying llava 7B model. I tried with fp16, with tp=1 and 8 but error remains same. with tp= 1:
with tp =8: (of course did chpt split and engine creation with tp=8)
error:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Same steps as in https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md#llava-and-vila
Expected behavior
Need to provide inference result.
actual behavior
Seg fault
additional notes
n/a