Incorrect GPU Assignment in MPI Inter-Node Processing with Single GPU Nodes

littlefatfat commented 4 months ago

System Info

CPU architecture x86_64
GPU name NVIDIA Tesla T4
TensorRT-LLM v0.9.0
NVIDIA driver version 12.3
OS Ubuntu 22.04.3

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

cd examples/llama
convert checkpoint & trt-llm build engines (tp = 2)
password-less SSH access between containers
mpirun -n 2 --hostfile hostfile.txt --allow-run-as-root python3 ../run.py --max_output_len=160 --tokenizer_dir /host/huggingface/Llama-2-7b-chat-hf/ --engine_dir /tmp/kunlun/cache/models/trt_engines/llama/fp16/tp2-2gpu/ --input_text "In python, write a function for binary searching an element in an integer array."

Expected behavior

gpu index should be assigned based on rank % actual_number_of_GPUs.

actual behavior

Rank 1 is incorrectly assigned to GPU 1, based on the assumption of 8 GPUs per node. ERROR Message: RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34)

additional notes

Description of the Issue: When running an MPI inter-node process with one T4 GPU per node (llama2-7b), a problem arises with two machines each equipped with a single GPU. With two total ranks (tp=2), rank 1 attempts to access GPU 1, which is incorrect. Upon code inspection, it appears that both the C++ and Python implementations default to assuming there are 8 GPUs per node (gpus_per_node set to 8). However, it should dynamically allocate GPUs based on the actual number of available GPUs on the node, using the formula rank % num_gpus.

littlefatfat commented 4 months ago

I will submit a suggestion for a fix soon. The Python implementation works successfully after the change, allowing for successful inference. The C++ version avoids the initial error after modification, but triggers other errors, including an NCCL error that can be debugged using NCCL_DEBUG=INFO for more details: Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'unhandled system error (run with NCCL_DEBUG=INFO for details)' sinian-t4-devel:4530:4530 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer tensorrt_llm_1gpu-devel-ruiyanm.test_network<33936>

MartinMarciniszyn commented 3 months ago

@Funatiq, could you please take a look at the PR?

Funatiq commented 3 months ago

gpus_per_node can be defined when building the engine . It is also stored in the config.json file. Can you please try to set the desired value in the config file or build the engine with the according parameter?

NVIDIA / TensorRT-LLM