Open YJHMITWEB opened 3 weeks ago
Thanks for your interests! @YJHMITWEB How do you run the test for multi-node? via the launch script we provided or just torchrun? If you check the launch.sh under script folder, you can see that we haven't released (or at least fully release) multi-node support yet IIRC. cc @zheng-ningxin
Describe the bug When running GemmRS on two nodes, each with 4 A100 80G connected via NVLINK. Each node has 1 NIC to IB HDR200.
To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior A clear and concise description of what you expected to happen.
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment Each node has 4 A100 80G, connected via NVLINK.
Interconnection is IB HDR200:
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.