infiniband Search Results

1000+ results
for infiniband

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

pytorch/pytorch #21015

How to use Infiniband for cpu-cluster with backend gloo?

Now I'm trying to build pytorch from source for my cpu-cluster with backend gloo. After installing pytorch, I got this information from install summay: ``` -- USE_DISTRIBUTED : True -- …

sth1997 updated 5 years ago
7
wtfbbqhax/libdnet #26

"dnet intf show" fails when Infiniband interface present, af…

``` What steps will reproduce the problem? 1. run "dnet intf show" on system with IB interface What is the expected output? Something similar to output from "ip addr show". 3: ib0: mtu 2044 qd…

GoogleCodeExporter updated 9 years ago
1
open-mpi/ompi #12209

MPI_TYPE_INDEXED + MPI_SEND/RECV slow with older infiniband …

Related to #12202 but without CUDA. On our shared-memory system (2xEPYC) MPI_TYPE_INDEXED works fast as expected, but as soon as our 40GBit Infiniband gets involved performance breaks down by a factor…

chhu updated 7 months ago
4
THUDM/GLM #173

在没有InfiniBand情况下能训练glm-large吗

现在有两台机器,打算测试一下多机多卡的训练,选择了large-chinese,现在训练的时候出现了问题 ``` 192.168.83.245: 595d69b310a0:48344:48344 [0] NCCL INFO Launch mode Parallel 192.168.83.245: 595d69b310a0:48345:48345 [1] NCCL INFO Broadcast:…

allendred updated 5 months ago
3
microsoft/pai #2738

[Job Exporter] Cannot listen to InfiniBand or multiple netwo…

Currently, job exporter only listens to one network interface choose from [configuration](https://github.com/microsoft/pai/blob/master/src/job-exporter/config/job-exporter.yaml#L20), cannot listen to …

abuccts updated 4 years ago
2
Azure/azhpc-images #33

MOFED support on CX3-Pro cards

CX3-Pro cards are not supported in newer Mellanox OFED versions, and these cards are supported through Mellanox OFED LTS version (4.9-0.1.7.0). For more information, see [Linux Drivers](https://www.m…

jithinjosepkl updated 3 months ago
3
NVIDIA/nccl #883

Training speed anomalies in multi-node task on Networking Dr…

Hi there, I'm running a multi-node training task on a SLURM cluster with a Networking Dragonfly Topology. Some of the nodes have double Infiniband while others have single Infiniband, and my nodes are…

SHshenhao updated 1 year ago
12
NVIDIA/enroot #118

99-mellanox.sh script broken when there is no "/sys/class/in…

The script [99-mellanox.sh ](https://github.com/NVIDIA/enroot/blob/master/conf/hooks/99-mellanox.sh) breaks on hosts with newer linux-rdma package. It looks like [this commit](https://patchwork.kernel…

jasonguy updated 2 years ago
5
bytedance/byteps #377

RDMA_CM_EVENT_ADDR_ERROR

**Describe the bug** When i run byteps with RDMA in 2 nodes. the node 2 can't bind to node1's scheduler **To Reproduce** Steps to reproduce the behavior: 1.build pytorch docker file: docker buil…

Ruinhuang updated 3 years ago
2
StanfordLegion/legion #1729

Legion: collective instance freeze on slingshot-11

I believe using collective instances results in a startup freeze on slingshot-11. I have one commit of S3D that uses them (https://gitlab.com/legion_s3d/legion_s3d/-/commit/e797d71367683580933166a0080…

syamajala updated 2 months ago
7

上一页 1...8 9 10 11 12 13 14...100 下一页

1000+ results for infiniband

1000+ results
for infiniband