horovod Search Results - Githubissues

1000+ results
for horovod

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

bytedance/byteps #68

How did you get the horovod & bytePS performance

I have the same hardware envs, same network, but I could not get the result as you, almost half as you. Any best practices and experience? thanks very much! for bytePS with 1 instance and 8 GPU, I ha…

compete369 updated 5 years ago
12
kubeflow/mpi-operator #328

TF2 Jobs latches on to CPUs if both CPU and GPU are provided…

I have applied the following MPI Job Yaml. I observe that when I run the workers with only the GPU specified in the resources section the TF2 Job proceeds very fast with `3s` per epoch. The TF2 Job is…

asahalyft updated 3 years ago
10
DifferentiableUniverseInitiative/horovod #5

multiple_communicators branch gets deadlock on Alltoall

When running [IDRIS-Hackathon](https://github.com/DifferentiableUniverseInitiative/IDRIS-hackathon) `fft_benchmark.job` I get the following message: ` W /horovod/horovod/common/stall_inspector.cc…

andrevitorelli updated 3 years ago
11
rwth-i6/i6_core #456

Rename variable `horovod_num_processes` in `ReturnnTrainingJ…

The jobs got extended to enable multi-GPU usage for the torch backend (see #444 and #445). The `horovod_num_processes` variable name is now incorrect. This change needs to be done carefully since this…

christophmluscher updated 1 year ago
2
Lightning-Universe/lightning-Horovod #15

accumulation_scheduler does not exist

## 🐛 Bug `Trainer.accumulation_scheduler` does not exist, which makes the strategy [code](https://github.com/Lightning-AI/lightning-Horovod/blob/main/src/lightning_horovod/strategy.py#L150) fail. …

zjost updated 1 year ago
2
aws/sagemaker-tensorflow-training-toolkit #379

pytest test/integration error

Test integration ``` pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu ``` Error stacktrace: …

ChaiBapchya updated 4 years ago
4
wangkuiyi/gotorch #369

Support data parallelism with a GPU cluster

# Data Parallelism Data parallelism replicates the model on every device to generates gradients independently and then communicates those gradients at each iteration to keep model replicas consiste…

QiJune updated 3 years ago
4
NVIDIA/nccl #306

Tensorflow processes with horovod(NCCL) get stuck during the…

I found that tensorflow process gets stuck after training lasts almost 30 hours and this problem can be reproduced every time. The usages of all gpus are 100% when all processes hang up. I have rai…

jianyuheng updated 4 years ago
16
DifferentiableUniverseInitiative/IDRIS-hackathon #3

Implementation of Horovod backend in Mesh TensorFlow

This issue is to track the developments needed to finalize and validate the Mesh TensorFlow implementation relying on horovod for the backend. This overarching goal will encapsulate several smaller i…

EiffL updated 3 years ago
2
horovod/horovod #2228

Transient Horovod stall when using Horovod.MxNet for distrib…

**Environment:** 1. Framework: MXNet 2. Framework version: 1.6.x 3. Horovod version: 0.18.2 4. MPI version: MPICH3 3.x 5. CUDA version: 10.1 6. NCCL version: 2.x 7. Python version: 3.6 8. Spar…

WilliamOnVoyage updated 4 years ago
4

上一页 1...6 7 8 9 10 11 12...100 下一页

1000+ results for horovod

1000+ results
for horovod