NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers #1388

Open emmanuelrajapandian opened 1 month ago

emmanuelrajapandian commented 1 month ago

Hi,

I am using AWS Sagemaker instance ml.g5.48xlarge which has 8 A10 Nvidia GPUs. I have 4 scripts each accessing 2 GPUs each. I am using vLLM to load mixtral LLM onto the respective GPUs such as follows:

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   
os.environ["CUDA_VISIBLE_DEVICES"]="6,7"

model = LLM(model='casperhansen/mixtral-instruct-awq',
            tensor_parallel_size=2,
            gpu_memory_utilization=0.7,
            trust_remote_code=True,
            dtype=torch.float16,
            seed=100,
            max_model_len=4096,
            quantization='awq')

However, I get the error such as:

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py:223, in NCCLLibrary.NCCL_CHECK(self, result) 221 if result != 0: 222 error_str = self.ncclGetErrorString(result) --> 223 raise RuntimeError(f"NCCL error: {error_str}")

RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers

The above code works fine for GPU numbers 0,1 and 2,3 separately but fails on 4,5 and 6,7 GPUs. How can I work around this error?

kiskra-nvidia commented 1 month ago

Weird. Could be an issue in the topology discovery code, I suppose. What NCCL version is that? Can you run with NCCL_DEBUG=INFO and send us the generated output (which could be large)? Are you supplying a NCCL topo file for that AWS instance by any chance?

emmanuelrajapandian commented 1 month ago

Weird. Could be an issue in the topology discovery code, I suppose. What NCCL version is that? Can you run with NCCL_DEBUG=INFO and send us the generated output (which could be large)? Are you supplying a NCCL topo file for that AWS instance by any chance?

NCCL version is 2.20.5, I am not supplying any NCCL topo file for the AWS instance. However, I found a work around from that which is as follows:

os.environ['NCCL_NVLS_ENABLE'] = '0'
os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'
os.environ['NCCL_SOCKET_IFNAME'] = 'eth1'
os.environ['NCCL_IB_DISABLE'] = '1'

Adding the above lines to the start of the script/jupyter notebook does not cause an error. I have read in some issues thread that different versions of CUDA, torch, NCCL cause this problem due to version incompatibility. I am still not sure if the above work around would work when I push my code to prod pipeline, I am still testing it out.

I will keep you posted.

kiskra-nvidia commented 1 month ago

This is single-node, correct? In that case, I wouldn't think that NCCL_IB_DISABLE would make much of a difference, and I would expect NCCL_SOCKET_IFNAME to be of marginal importance at best (NCCL will use sockets during bootstrap but the bulk of traffic should be traveling via the P2P transport using NVLinks). NCCL_NVLS_ENABLE shouldn't matter because NVLS is not supported on the Ampere architecture (you need Hopper).

So my recommendation is to try these options one-by-one and/or in different combinations to figure out which ones actually make a difference.

emmanuelrajapandian commented 1 month ago

Will try them one-by-one or in combinations. Thanks Kiskra for explaining how each variable in NCCL works.