Support multi-GPU training

AIStream-Peelout / flow-forecast

Deep learning PyTorch library for time series forecasting, classification, and anomaly detection (originally for flood forecasting).

https://flow-forecast.atlassian.net/wiki/spaces/FF/overview

GNU General Public License v3.0

1.97k stars 288 forks source link

Support multi-GPU training #676

Open srogatch opened 1 year ago

srogatch commented 1 year ago

I couldn't so far find a way to train on multiple GPUs within the same computer. If it exists, please, describe the way to do it.

isaacmg commented 1 year ago

Hello sorry for the delay. We do currently have Docker containers which you can use with Wandb to perform a distributed hyper-parameter sweep. IMO multi-GPU for a single model isn't much benefit it is very hard to saturate even a single GPU unless you have huge batch sizes. Bottleneck generally comes from things.

srogatch commented 1 year ago

I have batch size 64, history length 1440, lookahead 480, and 2 million points in the time series, each consisting of 4 values. A single GPU is saturated 97-100% currently, and judging from power consumption, it's indeed fully saturated and I can benefit from multiple GPUs.

isaacmg commented 1 year ago

Interesting, I've never really run into that problem before. Let me look into it. FF is built on top of PyTorch of course so it is hopefully it is something I could reasonably add quickly. Out of the box as of now though we don't support it as we mainly use model.to()

srogatch commented 1 year ago

Yes, we need to add DistributedDataParallel object, multi-processing launch, get the local rank of each process, and use it as the device parameter in model.to(). I planned to add this myself, but unfortunately, afterwards I had to postpone this project because I got some higher priorities.