-
Hi, I have a few questions about the training of PyramidNET
1. How long did you take to fully train pyramidNET with horovod (1 node, 4 GPUs?)
2. The version of pytorch and horovod?
3. Any tricks wh…
-
When launching multiple processes with horovod, downloading data may introduce race condition:
```
File "run_pretraining_hvd.py", line 255, in
start_step=args.start_step)
File "/home/ec2…
-
**The asynchronous collective communication layer also avoids having an expensive central coordinator, as used for invoking synchronous collective communication operations inexisting systems, such as …
-
C++ version issue
```
Building: Step 14/24 : RUN ldconfig /usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs && HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVO…
-
I got this error after about 100k iterations:
Fri Oct 23 17:16:00 2020[0]:[2020-10-23 17:16:00.707119: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, g…
-
## Description
Undefined symbol error upon importing horovod for stable release of mxnet on PyPi
Related to https://github.com/apache/incubator-mxnet/issues/16193
### Error Message
```
python…
-
TF2Estimator failed with horovod backend if data_creator is tf.data.Dataset from generator, the error is as below:
2021-07-21 04:17:10,169 WARNING worker.py:1107 -- A worker died or was killed whil…
-
- [x] #159
- [x] Create a README under `docs/` explaining how to manage documentation
- [x] organize RST files in subfolders reflecting the order of the navigation
- [x] Add python script for tutor…
-
It would be great to improve the documentation for first-time users. Current documentation assumes users know horovod APIs, but for users who just get into distributed training they do not necessarily…
-
Hello, what is your horovod version?
It failed to work in horovod 0.15.2