NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8k stars 877 forks source link

Incorrect GPU Assignment in MPI Inter-Node Processing with Single GPU Nodes #1494

Closed littlefatfat closed 1 week ago

littlefatfat commented 4 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

  1. cd examples/llama
  2. convert checkpoint & trt-llm build engines (tp = 2)
  3. password-less SSH access between containers
  4. mpirun -n 2 --hostfile hostfile.txt --allow-run-as-root python3 ../run.py --max_output_len=160 --tokenizer_dir /host/huggingface/Llama-2-7b-chat-hf/ --engine_dir /tmp/kunlun/cache/models/trt_engines/llama/fp16/tp2-2gpu/ --input_text "In python, write a function for binary searching an element in an integer array."

Expected behavior

gpu index should be assigned based on rank % actual_number_of_GPUs.

actual behavior

Rank 1 is incorrectly assigned to GPU 1, based on the assumption of 8 GPUs per node. ERROR Message: RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34)

additional notes

Description of the Issue: When running an MPI inter-node process with one T4 GPU per node (llama2-7b), a problem arises with two machines each equipped with a single GPU. With two total ranks (tp=2), rank 1 attempts to access GPU 1, which is incorrect. Upon code inspection, it appears that both the C++ and Python implementations default to assuming there are 8 GPUs per node (gpus_per_node set to 8). However, it should dynamically allocate GPUs based on the actual number of available GPUs on the node, using the formula rank % num_gpus.

littlefatfat commented 4 months ago

I will submit a suggestion for a fix soon. The Python implementation works successfully after the change, allowing for successful inference. The C++ version avoids the initial error after modification, but triggers other errors, including an NCCL error that can be debugged using NCCL_DEBUG=INFO for more details: Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'unhandled system error (run with NCCL_DEBUG=INFO for details)' sinian-t4-devel:4530:4530 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer tensorrt_llm_1gpu-devel-ruiyanm.test_network<33936>

MartinMarciniszyn commented 3 months ago

@Funatiq, could you please take a look at the PR?

Funatiq commented 3 months ago

gpus_per_node can be defined when building the engine . It is also stored in the config.json file. Can you please try to set the desired value in the config file or build the engine with the according parameter?