I filed this but i am led into wild goose chase here, need more seasoned, experienced engineer to look into:
https://github.com/ROCm/rccl/issues/1421
Once filed
1) it says you were able to reproduce
2) another guy named hackrill says it is specific to IG. which is wrong because I already told MI250. he just made improper conclusion because I was just lazy about providing CPU info (because it is irrelevant) which I later corrected.
3) later on he changes story that it is only reprocuble on IG only but fails to provide log let along supposedly successful run log on discrete (i.e. MI250),
3) he keeps asking for more information which I already provided.
This is shambles, and total, I can not follow through how it is being debugged here because it is done in such a blindly random way, need more season engineer to look it more seriously.
Because this is more basic/common nn.parallel model that is not running on mi250:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
On nvidia, i just runs fine.
Problem Description
I filed this but i am led into wild goose chase here, need more seasoned, experienced engineer to look into: https://github.com/ROCm/rccl/issues/1421 Once filed 1) it says you were able to reproduce 2) another guy named hackrill says it is specific to IG. which is wrong because I already told MI250. he just made improper conclusion because I was just lazy about providing CPU info (because it is irrelevant) which I later corrected. 3) later on he changes story that it is only reprocuble on IG only but fails to provide log let along supposedly successful run log on discrete (i.e. MI250), 3) he keeps asking for more information which I already provided. This is shambles, and total, I can not follow through how it is being debugged here because it is done in such a blindly random way, need more season engineer to look it more seriously. Because this is more basic/common nn.parallel model that is not running on mi250: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html On nvidia, i just runs fine.
Operating System
rhel9
CPU
epyc
GPU
mi250
ROCm Version
ROCm 6.2.0
ROCm Component
rccl
Steps to Reproduce
see https://github.com/ROCm/rccl/issues/1421 for details.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response