Fail to run BLIP2-T5 in multiple A30 TP

System Info

It worked when following https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md to run BLIP2-T5 XXL in single A100 GPU
However, I have only A30 for serving. Memory in single A30 isn't enough to support BLIP2-T5 XXL so I have to use 2 A30 GPUs to serve the model leveraging Tensor Parallel. When I use the same script (https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/run.py) to run the model, the script was hanging. Any ideas to resolve it? mpirun -n 2 python run.py \ --blip2_encoder \ --max_new_tokens 30 \ --input_text "Question: which city is this? Answer:" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir visual_engines/${MODEL_NAME} \ --llm_engine_dir trt_engines/${MODEL_NAME}/2-gpu/bfloat16/tp2

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Hanging when running mpirun --allow-run-as-root -np 2 python run.py \ --blip2_encoder \ --max_new_tokens 30 \ --input_text "Question: which city is this? Answer:" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir visual_engines/${MODEL_NAME} \ --llm_engine_dir trt_engines/${MODEL_NAME}/2-gpu/bfloat16/tp2 but works for running: python run.py \ --blip2_encoder \ --max_new_tokens 30 \ --input_text "Question: which city is this? Answer:" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir visual_engines/${MODEL_NAME} \ --llm_engine_dir trt_engines/${MODEL_NAME}/1-gpu/bfloat16/tp1

Expected behavior

Finish as python run.py \ --blip2_encoder \ --max_new_tokens 30 \ --input_text "Question: which city is this? Answer:" \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir visual_engines/${MODEL_NAME} \ --llm_engine_dir trt_engines/${MODEL_NAME}/1-gpu/bfloat16/tp1

NVIDIA / TensorRT-LLM