StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
68 stars 12 forks source link

Support Construction Starts project by investigating running fastai/pytorch jobs in distributed multi-GPU setup #347

Closed ca-scribner closed 3 years ago

ca-scribner commented 3 years ago

Goals:

ca-scribner commented 3 years ago

For general distributed pytorch GPU job status, see #360. For construction starts specific details, read below

ca-scribner commented 3 years ago

Summary

I was able to get a fastai v1.58.0/pytorch v1.4 (versions used by construction starts) job to run in Data Distributed Parallel mode (DDP, the recommended mode for distributing training across GPUs either on the same or separate nodes) across two GPUs (two nodes of 1GPU each) on AAW. There were a few hurdles along the way as discussed below.

In the short term, this does work but takes some effort and is hard without some extra dev privileges (mainly being able to build your own custom images). In the longer term, a little extra dev work on our side would make this very usable (a python/shell launch utility could handle the yaml syntax so people don't need to know it, maybe some tooling to help with data transfer, and some procedures around iterating on code without needing new images to be built).

Whether distributed training actually reduces training times (eg: whether jobs scale with more GPUs) is another question and very problem specific :)

A generic walkthrough of distributed training is under construction in our docs (note this link is to a branch because this hasn't merged into master yet. If you get a file not found, check the master branch /docs/en/1-Experiments/Distributed-Training.md)

How to do Data Distributed Parallel

This link describes things more, but I've also uploaded some examples to the Construction Starts Modelling repo:

Hurdles for Regular Use

The big roadblocks for using this regularly are:

Results with Multiple GPUs

I definitely saw speedup for some cases, but it was situational. For toy problems I saw near linear speedup, but for the construction starts project it was closer to 0.25-0.5 linear (eg: 2GPU is 1.25-1.5x as fast as 1GPU). This probably could have been improved by doing some learning rate tweaking (GPU scaling like this effectively scales the batch_size, which would let you do a larger learning rate). I played with it a little but not a ton.

ca-scribner commented 3 years ago

cc @chritter

ca-scribner commented 3 years ago

Training from S3 Storage

Along the way during this task when trying to find a better way to handle transferring training data to each worker, I also set up a trainer that trained directly from S3 storage rather than from local files. The objective there was to make it so you never need to care where your files are.

The effort was a partial success, but with more work I think it could be fully successful. I successfully trained directly from S3 (see this example. The only problem is that the current implementation cannot be parallelized. Usually these data loaders get speed through parallelization (I see your typical setup has data loader workers loading your images to feed the GPU), but the way they parallelize breaks the way I passed an S3 client to the loader*. So the current demo trains slowly, but that should be overcome by fixing the parallel data loader problem. We could also add things like local caching of images, etc, if it was a real problem.

* the data object pickles itself and passes that pickle to each parallel loader. But, a logged in boto3 client cannot be pickled! So if we set num_workers>1 atm, it'll break with a funny error. I think the way around it is to instead pass credentials rather than a logged-in boto3 client, then the loader logs itself in when it starts fetching data. Should be an easy fix (I think there's some fixtures in the data loaders that help this) but didn't have time to do it. ** I think this is easier in newer versions of fast.ai/pytorch. The older APIs made this all a bit harder