-
Now I'm trying to build pytorch from source for my cpu-cluster with backend gloo.
After installing pytorch, I got this information from install summay:
```
-- USE_DISTRIBUTED : True
-- …
-
```
What steps will reproduce the problem?
1. run "dnet intf show" on system with IB interface
What is the expected output?
Something similar to output from "ip addr show".
3: ib0: mtu 2044 qd…
-
Related to #12202 but without CUDA. On our shared-memory system (2xEPYC) MPI_TYPE_INDEXED works fast as expected, but as soon as our 40GBit Infiniband gets involved performance breaks down by a factor…
chhu updated
7 months ago
-
现在有两台机器,打算测试一下多机多卡的训练,选择了large-chinese,现在训练的时候出现了问题
```
192.168.83.245: 595d69b310a0:48344:48344 [0] NCCL INFO Launch mode Parallel
192.168.83.245: 595d69b310a0:48345:48345 [1] NCCL INFO Broadcast:…
-
Currently, job exporter only listens to one network interface choose from [configuration](https://github.com/microsoft/pai/blob/master/src/job-exporter/config/job-exporter.yaml#L20), cannot listen to …
-
CX3-Pro cards are not supported in newer Mellanox OFED versions, and these cards are supported through Mellanox OFED LTS version (4.9-0.1.7.0). For more information, see [Linux Drivers](https://www.m…
-
Hi there, I'm running a multi-node training task on a SLURM cluster with a Networking Dragonfly Topology. Some of the nodes have double Infiniband while others have single Infiniband, and my nodes are…
-
The script [99-mellanox.sh ](https://github.com/NVIDIA/enroot/blob/master/conf/hooks/99-mellanox.sh) breaks on hosts with newer linux-rdma package. It looks like [this commit](https://patchwork.kernel…
-
**Describe the bug**
When i run byteps with RDMA in 2 nodes. the node 2 can't bind to node1's scheduler
**To Reproduce**
Steps to reproduce the behavior:
1.build pytorch docker file: docker buil…
-
I believe using collective instances results in a startup freeze on slingshot-11. I have one commit of S3D that uses them (https://gitlab.com/legion_s3d/legion_s3d/-/commit/e797d71367683580933166a0080…