-
Hi there, my MPI program hits a seg fault when running on a single Infiniband-enabled node. I'm trying to understand whether it's related to this issue: https://github.com/open-mpi/ompi/issues/6666.
…
-
### Host operating system: output of `uname -a`
Linux server01 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
### node_exporter version: output of `no…
-
### Describe the bug
I am not sure if this is a bug related to UCX, but I would like to understand more about it. I built OpenMPI 5 with UCX from [HPC-X](https://developer.nvidia.com/networking/hpc-x…
-
Hello!
It seems that something has changed in Linux 4.9 regarding the way it represents bonded Mellanox interfaces which leads to broken offloading functionality of VMA for teamed interfaces.
[r…
-
I am using UCX 1.5.1 with MLNX HDR-200. When I enabled SMT on our AMD EPYC 7742 nodes, my GROMACS job crashes right after startup:
% module load openmpi/intel19/4.0.1
% mpirun --bind-to none -…
-
Found that WITH_OFED only configures some of the userspace diagnostics - which are very helpful for configuration, but doesn't enable the actual subnet manager (without which an Infiniband network can…
-
2node16 H20 GPU allredcue performance is 343GBps(with NVL SHARP),But theoretically it should be able to reach 460GBps
```
1048576 262144 float sum -1 121.8 8.61 16.1…
-
**Describe the bug**
When I upgrade to DeepSpeed 0.14.3, training does not progress because all gradients and gradient norms are zero. From using git bisect, I think it's from this PR:
https://git…
-
I want to run multiple broadcasts concurrently. They will all send data into one host with multiple NICs. I do not want one single NIC to be the bottleneck. So I prefer to let these broadcasts use dif…
-
In Section 4.3 where one-sided operations are discussed, we see there are two problems to support one-sided operations, and the first is the local FFR does not know the corresponding s-mem on the othe…