ROCm / rccl

ROCm Communication Collectives Library (RCCL)
https://rocmdocs.amd.com/projects/rccl/en/latest/
Other
268 stars 120 forks source link

[Issue]: tried to use nn.dataParallel however crashed #1421

Open jdgh000 opened 1 day ago

jdgh000 commented 1 day ago

Problem Description

Ran following example: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html with little modification but it failed during run: if I apply nn.dataParallel to model then it occurs, without applying it works model = nn.DataParallel(model)

code:

import sys
sys.path.append('..')
from classes import *

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):
    DEBUG = 0
    DEBUGL2 = 0

    def __init__(self, size, length):

        if self.DEBUG:
            print("GG: RandomDataset.__init__(size=", size, "length: ", length, ")")

        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):

        if self.DEBUGL2:
            print("GG: RandomDataset.__getitem__(index=", index, ")")

        return self.data[index]

    def __len__(self):

        if self.DEBUG:
            print("GG: RandomDataset.__len__() returning self.len: ", self.len)

        return len(self.data)

# Parameters and DataLoaders
input_size = 1000
output_size = 10

batch_size = 1000
data_size = 60000

if not torch.cuda.is_available():
    print("GPU is not detected.")
    quit(1)

device = torch.device("cuda:0")

# Create random data set: input size = 1k, data_size = 60k, batch_size: 1k.

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

model = Model(input_size, output_size)

if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(), "output_size", output.size())
 root@u488 dataparallellism]$ sudo python3 ex1.py
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
 root@u488 dataparallellism]$ nano -w "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py"
 root@u488 dataparallellism]$ cat /opt/rocm/.info/version
6.2.0-66

Operating System

rhel9

CPU

9500hx ryzen

GPU

mi250

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

Run example code with nn.dataParallel (actual code pasted in problem description):

https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

harkgill-amd commented 1 day ago

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

zichguan-amd commented 1 day ago

Hi @jdgh000, looks like you are running on a laptop with integrated graphics, you can check if rocminfo shows two graphics devices. Since integrated graphics are not supported, you can bypass it by setting the environment variable HIP_VISIBLE_DEVICES to only use the discrete GPU as documented here: https://rocmdocs.amd.com/projects/HIP/en/develop/how-to/debugging.html#making-device-visible

jdgh000 commented 1 day ago

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

thx, let me know,

harkgill-amd commented 1 day ago

As @zichguan-amd mentioned, this has to do with the example being ran on your APU rather than a dedicated graphics card. Correct me if I'm wrong, but I believe you're running on a 5900HX. Could you try running directly on your dGPU by adding this line at the top of your python script?


os.environ['HIP_VISIBLE_DEVICES']='0'
jdgh000 commented 1 day ago

this is not apu sure, cpu model I put is wrong. it is mi250. since cpu model is not that important, i just typed the suggestion.

jdgh000 commented 1 day ago

Name: AMD EPYC 7763 64-Core Processor Name: AMD EPYC 7763 64-Core Processor Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a

zichguan-amd commented 19 hours ago

In that case can you run with NCCL_DEBUG=INFO or NCCL_DEBUG=TRACE for details as suggested by the error message?

jdgh000 commented 14 hours ago

I saw the prompt and did few times but does not seem to outputting much than not using...either TRACE or INFO

sudo mkdir log ; NCCL_DEBUG=INFO sudo python3 ex1.py 2>&1 | sudo tee log/ex1-NCCL_DEBUG.INFO.log
mkdir: cannot create directory ‘log’: File exists
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/1-dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
jdgh000 commented 14 hours ago

seems failing in one of these: /usr/local/lib64/python3.9/site-packages/torch/_C/init.pyi:10823:def _broadcast_coalesced( /usr/local/lib64/python3.9/site-packages/torch/_C/_distributed_c10d.pyi:619:def _broadcast_coalesced( but i can only function prototype, not body, can not see what is going on in these call

zichguan-amd commented 11 hours ago

With sudo you need to use -E to preserve the environment variables. Also, can you upgrade to the latest ROCm 6.2.4 and PyTorch 2.5.1 and see if that fixes it?

jdgh000 commented 9 hours ago

It is already torch2.6.1 and ROCm6.2.4 torch 2.5.1+rocm6.2 torchaudio 2.5.1+rocm6.2 torchvision 0.20.1+rocm6.2