Open dongmin-ra opened 3 weeks ago
Hi @dongmin-ra. Internal ticket has been created to investigate your issue. Thanks!
Hi @dongmin-ra, thanks for providing the test scripts and I can run it successfully on our MI300 machines. For the failure, can you provide a full log? and output of rocminfo/rocm-smi, thanks.
🐛 Describe the bug
On image
rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0
androcm/pytorch:rocm6.2_ubuntu22.04_py3.9_pytorch_release_2.2.1
, rccl allreduce fails.docker run command
test script
error message
Versions