bytedance / flux

A fast communication-overlapping library for tensor parallelism on GPUs.
Apache License 2.0
197 stars 13 forks source link

[BUG] Illegal memory with multi-node #40

Open YJHMITWEB opened 3 weeks ago

YJHMITWEB commented 3 weeks ago

Describe the bug When running GemmRS on two nodes, each with 4 A100 80G connected via NVLINK. Each node has 1 NIC to IB HDR200.

W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779] 
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779] *****************************************
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779] *****************************************
before flux shm initialization
before flux shm initialization
before flux shm initialization
before flux shm initialization
after flux shm initialization
after flux shm initialization
after flux shm initialization
after flux shm initialization
[rank1]:[E907 22:34:22.723784108 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x145b3b587f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x145b3b536d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x145b3b663ee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x145b3c86fa36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x145b3c874c50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x145b3c87b90a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x145b3c87dd4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x145b9ef0bbf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x145bb4c711ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x145bb4153e73 in /lib64/libc.so.6)

[rank2]:[E907 22:34:22.821294480 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14e20c855f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14e20c804d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14e20c931ee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14e20db3da36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14e20db42c50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14e20db4990a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14e20db4bd4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x14e2701d9bf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x14e285f3f1ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x14e285421e73 in /lib64/libc.so.6)

[rank3]:[E907 22:34:22.821407422 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x1529560c1f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x152956070d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x15295619dee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x1529573a9a36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x1529573aec50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x1529573b590a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x1529573b7d4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x1529b9a45bf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x1529cf7ab1ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x1529cec8de73 in /lib64/libc.so.6)

[rank0]:[E907 22:34:22.823452114 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x153118a11f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1531189c0d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x153118aedee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x153119cf9a36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x153119cfec50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x153119d0590a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x153119d07d4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x15317c395bf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x1531920fb1ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x1531915dde73 in /lib64/libc.so.6)

W0907 22:34:23.038000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1834997 closing signal SIGTERM
W0907 22:34:23.038000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1834999 closing signal SIGTERM
W0907 22:34:23.038000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1835000 closing signal SIGTERM
E0907 22:34:23.482000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 1834998) of binary: flux/bin/python
Traceback (most recent call last):
  File "flux/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.1', 'console_scripts', 'torchrun')())
  File "flux/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "flux/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "flux/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
test/test_gemm_rs.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------

To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

Expected behavior A clear and concise description of what you expected to happen.

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

Environment Each node has 4 A100 80G, connected via NVLINK.

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NV4     NV4     SYS     24-34,36-46     1               N/A
GPU1    NV4      X      NV4     NV4     SYS     0-10,12-22      0               N/A
GPU2    NV4     NV4      X      NV4     NODE    72-82,84-94     3               N/A
GPU3    NV4     NV4     NV4      X      SYS     48-58,60-70     2               N/A
NIC0    SYS     SYS     NODE    SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

Interconnection is IB HDR200:

$ ibstat
CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.36.1010
        Hardware version: 0
        Node GUID: 0xe8ebd30300401a34
        System image GUID: 0xe8ebd30300401a34
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 766
                LMC: 0
                SM lid: 1864
                Capability mask: 0xa651e848
                Port GUID: 0xe8ebd30300401a34
                Link layer: InfiniBand

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

wenlei-bao commented 2 weeks ago

Thanks for your interests! @YJHMITWEB How do you run the test for multi-node? via the launch script we provided or just torchrun? If you check the launch.sh under script folder, you can see that we haven't released (or at least fully release) multi-node support yet IIRC. cc @zheng-ningxin