-
### Summary
no
### DeePMD-kit Version
x
### Backend and its version
x
### Python Version, CUDA Version, GCC Version, LAMMPS Version, etc
_No response_
### Details
I tried to install deepmd-ki…
-
Can we have the Distributed training with Horovod version -- we want to speed up LM training via a cluster of GPU machines.
-
**Describe the bug**
We've been doing some benchmarking of horovod vs byte-ps on AWS. We were hoping to see some performance improvements from using byteps for 64 GPU jobs. We've noticed that byte-ps…
-
**Install Environment:**
VMware15.0, Ubuntu 18.04, python3.6.9
***Error:****
Running setup.py install for horovod ... error
ERROR: Command errored out with exit status 1:
command: /h…
-
Hi, is there any chance support for DistributedDataParallel or Horovod could be added for efficient distributed training?
I have a dataset of approx. 10M images worth of video frames that I'd like …
-
The script of "trainer.py" contain "import horovod.torch as hvd", but, I can't find any info of Horovod , Why?
I want to speed up the training by using the distributed training on GPU , maybe it ca…
-
**Environment:**
1. Framework: (TensorFlow, Keras, PyTorch, MXNet) PyTorch
2. Framework version: 2.0.1+cu117
3. Horovod version: 0.28.1
4. MPI version: 4.0.3
5. CUDA version:
6. NCCL version:
7…
-
in my training I saw below messages, not sure the impact. Anyone can help explain?
1469.worker.1 | [2019-06-04T16:49:20Z] [2019-06-04 16:49:19.737850: W horovod/common/operations.cc:588] One or mo…
-
**Environment:**
1. Framework: TensorFlow,
2. Framework version: 2.16
3. Horovod version: 0.28.1
4. MPI version:
5. CUDA version: 12.2
6. NCCL version:
7. Python version: 3.11.8
8. Spark / …
-