ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 56 forks source link

RCCL AllReduce fails on docker container #1683

Open dongmin-ra opened 3 weeks ago

dongmin-ra commented 3 weeks ago

🐛 Describe the bug

On image rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0 and rocm/pytorch:rocm6.2_ubuntu22.04_py3.9_pytorch_release_2.2.1, rccl allreduce fails.

docker run command

docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd  --device /dev/dri \
   --rm \
   rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0 \
   bash

test script

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '23355'

    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def test(rank, world_size):
    setup(rank, world_size)
    torch.cuda.set_device(rank)

    t = torch.ones(10,10).cuda()
    dist.all_reduce(t)
    if rank == 0:
        print(t)
    cleanup()

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == "__main__":
    run_demo(test, 2)

error message

:0:rocdevice.cpp            :2984: 185480406807 us: [pid:1757  tid:0x7f4fa11ff700] Callback: Queue 0x7f4f98c00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
:0:rocdevice.cpp            :2984: 185480408190 us: [pid:1756  tid:0x7f5517bff700] Callback: Queue 0x7f530ce00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

Versions

Collecting environment information...
PyTorch version: 2.3.0a0+git96dd291
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.2.41133-dd7f95766

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.2.0 24292 26466ce804ac523b398608f17388eb6d605a3f09)
CMake version: version 3.26.4
Libc version: glibc-2.31

Python version: 3.9.19 (main, May  6 2024, 19:43:03)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.13.0-35-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI250X/MI250 (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.2.41133
MIOpen runtime version: 3.2.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7413 24-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         1500.000
CPU max MHz:                     3630.8101
CPU min MHz:                     1500.0000
BogoMIPS:                        5289.81
Virtualization:                  AMD-V
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        24 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-23,48-71
NUMA node1 CPU(s):               24-47,72-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; LFENCE, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] mypy==1.8.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.20.3
[pip3] optree==0.9.1
[pip3] torch==2.3.0a0+git96dd291
[pip3] torchvision==0.18.0a0+68ba7ec
[pip3] triton==2.3.0
[conda] No relevant packages
ppanchad-amd commented 1 week ago

Hi @dongmin-ra. Internal ticket has been created to investigate your issue. Thanks!

huanrwan-amd commented 4 days ago

Hi @dongmin-ra, thanks for providing the test scripts and I can run it successfully on our MI300 machines. For the failure, can you provide a full log? and output of rocminfo/rocm-smi, thanks.