Open jdgh000 opened 1 day ago
Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.
Hi @jdgh000, looks like you are running on a laptop with integrated graphics, you can check if rocminfo
shows two graphics devices. Since integrated graphics are not supported, you can bypass it by setting the environment variable HIP_VISIBLE_DEVICES
to only use the discrete GPU as documented here: https://rocmdocs.amd.com/projects/HIP/en/develop/how-to/debugging.html#making-device-visible
Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.
thx, let me know,
As @zichguan-amd mentioned, this has to do with the example being ran on your APU rather than a dedicated graphics card. Correct me if I'm wrong, but I believe you're running on a 5900HX. Could you try running directly on your dGPU by adding this line at the top of your python script?
os.environ['HIP_VISIBLE_DEVICES']='0'
this is not apu sure, cpu model I put is wrong. it is mi250. since cpu model is not that important, i just typed the suggestion.
Name: AMD EPYC 7763 64-Core Processor Name: AMD EPYC 7763 64-Core Processor Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a Name: gfx90a
In that case can you run with NCCL_DEBUG=INFO
or NCCL_DEBUG=TRACE
for details as suggested by the error message?
I saw the prompt and did few times but does not seem to outputting much than not using...either TRACE or INFO
sudo mkdir log ; NCCL_DEBUG=INFO sudo python3 ex1.py 2>&1 | sudo tee log/ex1-NCCL_DEBUG.INFO.log
mkdir: cannot create directory ‘log’: File exists
Let's use 8 GPUs!
Traceback (most recent call last):
File "/root/pytorch/dataparallellism/1-dataparallellism/ex1.py", line 41, in <module>
output = model(input)
File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
seems failing in one of these: /usr/local/lib64/python3.9/site-packages/torch/_C/init.pyi:10823:def _broadcast_coalesced( /usr/local/lib64/python3.9/site-packages/torch/_C/_distributed_c10d.pyi:619:def _broadcast_coalesced( but i can only function prototype, not body, can not see what is going on in these call
With sudo
you need to use -E
to preserve the environment variables. Also, can you upgrade to the latest ROCm 6.2.4 and PyTorch 2.5.1 and see if that fixes it?
It is already torch2.6.1 and ROCm6.2.4 torch 2.5.1+rocm6.2 torchaudio 2.5.1+rocm6.2 torchvision 0.20.1+rocm6.2
Problem Description
Ran following example: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html with little modification but it failed during run: if I apply nn.dataParallel to model then it occurs, without applying it works model = nn.DataParallel(model)
code:
Operating System
rhel9
CPU
9500hx ryzen
GPU
mi250
ROCm Version
ROCm 6.2.0
ROCm Component
rccl
Steps to Reproduce
Run example code with nn.dataParallel (actual code pasted in problem description):
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response