-
Hi, sir, after I enter the command `./rdma-tutorial lsy 1234`, I'll get a `segmentation fault` error output, any help?
-
**Describe the bug**
I trained BytePS benchmarks shown in `step-by-step-tutorial.md` on AWS EC2 p3dn.24xlarge instance, this instance contains 100Gbps Network, 8 V100 GPUs connected by NVLink, which …
-
Dear authors,
I'm pretty interested in the Grasper paper and would like to try it by myself. I noticed that in the grasper-conf.ini, there is one option with USE_RDMA = true. Does this mean that Gr…
-
**Describe the bug**
Check failed: mr happen on scheduler when RDMA enabled
**To Reproduce**
We have 2 GPU nodes and 2 CPU nodes, and could run byteps using tcp-ip successfully, but when trying t…
yma11 updated
3 years ago
-
Hello,
I was following the Step-by-Step tutorial and try to build from the source code.
The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but whe…
-
hello,I noticed that experiment "We conduct the experiments mostly on a platform consisting of two servers ..." in the paper.
What I want to ask is how to to config or modify the code and script …
-
We have identified that deploying multiple modules for data transmission on each server within the cluster leads to second-level latency tails in RDMA cluster data transfers, as detailed in [issue 997…
-
-
**Describe the bug**
When i run byteps with RDMA in 2 nodes. the node 2 can't bind to node1's scheduler
**To Reproduce**
Steps to reproduce the behavior:
1.build pytorch docker file: docker buil…
-
With Mike's help, I found that my model submitted to Brain-Score will generate the following errors after submission.
Mike told me that this error happened because, when doing engineering scoring, …