Closed littlefatfat closed 1 week ago
I will submit a suggestion for a fix soon. The Python implementation works successfully after the change, allowing for successful inference. The C++ version avoids the initial error after modification, but triggers other errors, including an NCCL error that can be debugged using NCCL_DEBUG=INFO for more details: Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'unhandled system error (run with NCCL_DEBUG=INFO for details)' sinian-t4-devel:4530:4530 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer tensorrt_llm_1gpu-devel-ruiyanm.test_network<33936>
@Funatiq, could you please take a look at the PR?
gpus_per_node
can be defined when building the engine . It is also stored in the config.json
file. Can you please try to set the desired value in the config file or build the engine with the according parameter?
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
gpu index should be assigned based on rank % actual_number_of_GPUs.
actual behavior
Rank 1 is incorrectly assigned to GPU 1, based on the assumption of 8 GPUs per node. ERROR Message: RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34)
additional notes
Description of the Issue: When running an MPI inter-node process with one T4 GPU per node (llama2-7b), a problem arises with two machines each equipped with a single GPU. With two total ranks (tp=2), rank 1 attempts to access GPU 1, which is incorrect. Upon code inspection, it appears that both the C++ and Python implementations default to assuming there are 8 GPUs per node (gpus_per_node set to 8). However, it should dynamically allocate GPUs based on the actual number of available GPUs on the node, using the formula rank % num_gpus.