-
I have the same hardware envs, same network, but I could not get the result as you, almost half as you. Any best practices and experience? thanks very much! for bytePS with 1 instance and 8 GPU, I ha…
-
I have applied the following MPI Job Yaml. I observe that when I run the workers with only the GPU specified in the resources section the TF2 Job proceeds very fast with `3s` per epoch. The TF2 Job is…
-
When running [IDRIS-Hackathon](https://github.com/DifferentiableUniverseInitiative/IDRIS-hackathon) `fft_benchmark.job` I get the following message:
`
W /horovod/horovod/common/stall_inspector.cc…
-
The jobs got extended to enable multi-GPU usage for the torch backend (see #444 and #445). The `horovod_num_processes` variable name is now incorrect. This change needs to be done carefully since this…
-
## 🐛 Bug
`Trainer.accumulation_scheduler` does not exist, which makes the strategy [code](https://github.com/Lightning-AI/lightning-Horovod/blob/main/src/lightning_horovod/strategy.py#L150) fail.
…
-
Test integration
```
pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu
```
Error stacktrace:
…
-
# Data Parallelism
Data parallelism replicates the model on every device to generates gradients independently and then communicates those gradients at each iteration to keep model replicas consiste…
-
I found that tensorflow process gets stuck after training lasts almost 30 hours and this problem can be reproduced every time. The usages of all gpus are 100% when all processes hang up.
I have rai…
-
This issue is to track the developments needed to finalize and validate the Mesh TensorFlow implementation relying on horovod for the backend. This overarching goal will encapsulate several smaller i…
EiffL updated
3 years ago
-
**Environment:**
1. Framework: MXNet
2. Framework version: 1.6.x
3. Horovod version: 0.18.2
4. MPI version: MPICH3 3.x
5. CUDA version: 10.1
6. NCCL version: 2.x
7. Python version: 3.6
8. Spar…