-
**Environment:**
1. Framework: (TensorFlow, Keras, PyTorch, MXNet)
2. Framework version:PyTorch
3. Horovod version:0.28.0
4. MPI version:4.0.7
5. CUDA version:11.4
6. NCCL version:2.11.4
7. Pyt…
-
**Environment:**
Kubernetes Version: 1.21.9
Cloud Provider: Azure (AKS)
**Bug report:**
I am using the official Helm Chart provided by Horovod on this link (https://github.com/horovod/horovod/tr…
-
### 🚀 The feature, motivation and pitch
A good profiling tool appears to be lacking for both DDP and FSDP.
### Alternatives
None.
### Additional context
Something like Horovod Timeline but bette…
-
**Environment:**
1. Framework: PyTorch
2. Framework version:
3. Horovod version:0.26.1
4. MPI version:
5. CUDA version: 11.1
6. NCCL version: nccl-local-repo-rhel7-2.8.4-cuda11.1-1.0-1.x86_64.rp…
-
* open mpi
* horovod
* nccl
* gloo
* mvapich2
-
See the [overview of distributed TensorFlow in general (independent of RETURNN)](https://github.com/rwth-i6/returnn/wiki/Distributed-TensorFlow) for some background.
This issue here is about the sp…
-
Following is the mpi-operator configuration file i am trying to deploy on our kubernetes cluster.
```
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-mnist
spec:
…
-
The environment requirements:
```python
(base) ray@ip-172-31-36-78:~/horovod-gpu/ray_lightning/ray_lightning/examples$ pip list | grep lightning
lightning-bolts 0.4.0
pytor…
-
I was running SSD on 4 nodes, each has 8 GPUs
Here is my running code: (I can't include the mpirun or the format will be bad)
` mpirun
-np 32 \
-hostfile hosts4 \
-bind-to none …
-
**Environment:**
1. Framework: (TensorFlow,keras)
2. Framework version:
3. Horovod version: 0.18
4. MPI version: 4.0.0
5. CUDA version:
6. NCCL version:
7. Python version: 3.6
8. OS and vers…