ROCm / rccl

ROCm Communication Collectives Library (RCCL)
https://rocmdocs.amd.com/projects/rccl/en/latest/
Other
275 stars 122 forks source link

[Issue]: nccl nn.parallel error, need more experienced to look #1442

Open jdgh000 opened 1 day ago

jdgh000 commented 1 day ago

Problem Description

I filed this but i am led into wild goose chase here, need more seasoned, experienced engineer to look into: https://github.com/ROCm/rccl/issues/1421 Once filed 1) it says you were able to reproduce 2) another guy named hackrill says it is specific to IG. which is wrong because I already told MI250. he just made improper conclusion because I was just lazy about providing CPU info (because it is irrelevant) which I later corrected. 3) later on he changes story that it is only reprocuble on IG only but fails to provide log let along supposedly successful run log on discrete (i.e. MI250), 3) he keeps asking for more information which I already provided. This is shambles, and total, I can not follow through how it is being debugged here because it is done in such a blindly random way, need more season engineer to look it more seriously. Because this is more basic/common nn.parallel model that is not running on mi250: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html On nvidia, i just runs fine.

Operating System

rhel9

CPU

epyc

GPU

mi250

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

see https://github.com/ROCm/rccl/issues/1421 for details.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

nileshnegi commented 1 day ago

Able to run on 1 and 8 gfx90a GPUs. See attached logs: stdout_ngpu1.log stdout_ngpu8.log

How are you building ROCm PyTorch?