Distributed training in SLURM

awslabs / gluonts

Probabilistic time series modeling in Python

https://ts.gluon.ai

Apache License 2.0

4.65k stars 755 forks source link

Distributed training in SLURM #565

Open tqdo opened 4 years ago

tqdo commented 4 years ago

Just wondering whether Gluonts supports distributed training, particularly in SLURM. I have access to multiple GPUs on my uni's clusters and would like to utilize them if possible. Training with 1 GPU is too slow given the size of my dataset.

gjmulder commented 4 years ago

GluonTS is built on Apache MXNet. If SLURM supports distributed MXNet training then in theory GluonTS should as well.

Note however that for many RNN-based models there may be sequential training steps MXNet needs to execute. Amdahl's law indicates you may not see a significant amount of scaling with multiple GPUs.