Open tqdo opened 4 years ago
GluonTS is built on Apache MXNet. If SLURM supports distributed MXNet training then in theory GluonTS should as well.
Note however that for many RNN-based models there may be sequential training steps MXNet needs to execute. Amdahl's law indicates you may not see a significant amount of scaling with multiple GPUs.
Just wondering whether Gluonts supports distributed training, particularly in SLURM. I have access to multiple GPUs on my uni's clusters and would like to utilize them if possible. Training with 1 GPU is too slow given the size of my dataset.