ca-scribner commented 3 years ago

Goals:

demonstrate running pytorch jobs in distributed multi-gpu setup on platform
demonstrate running fastai jobs in distributed multi-gpu setup on platform
advise construction starts project on how to achieve above multi-gpu setups to scale up their network training
build a tutorial/guide notebooks for distributed multi-gpu jobs

ca-scribner commented 3 years ago

For general distributed pytorch GPU job status, see #360. For construction starts specific details, read below

ca-scribner commented 3 years ago

Summary

I was able to get a fastai v1.58.0/pytorch v1.4 (versions used by construction starts) job to run in Data Distributed Parallel mode (DDP, the recommended mode for distributing training across GPUs either on the same or separate nodes) across two GPUs (two nodes of 1GPU each) on AAW. There were a few hurdles along the way as discussed below.

In the short term, this does work but takes some effort and is hard without some extra dev privileges (mainly being able to build your own custom images). In the longer term, a little extra dev work on our side would make this very usable (a python/shell launch utility could handle the yaml syntax so people don't need to know it, maybe some tooling to help with data transfer, and some procedures around iterating on code without needing new images to be built).

Whether distributed training actually reduces training times (eg: whether jobs scale with more GPUs) is another question and very problem specific :)

A generic walkthrough of distributed training is under construction in our docs (note this link is to a branch because this hasn't merged into master yet. If you get a file not found, check the master branch /docs/en/1-Experiments/Distributed-Training.md)

How to do Data Distributed Parallel

This link describes things more, but I've also uploaded some examples to the Construction Starts Modelling repo:

Toy Cat/Dog Example: Shows basics of distributed training. This example "distributes" the training to two notebooks, but just on the same node (so there isn't a real speedup, it just shows the process works). You can use a few copies of this notebook to show distributed training is working (you'll see the multiple notebooks all wait for each other before training, etc).
Example using Construction Starts code: This takes the example case provided by the construction starts project and builds it distributed like the toy example. Again this is "distributed" across two notebooks, but both on the same node so there's no actual speedup.
Example of Construction Starts code distributed across two nodes: This example uses the PyTorch Operator to submit a .py file for training across multiple nodes. It requires the code to be built into a docker image (there could be some workarounds - let me know if you actually want to use it), and we kubectl create the yaml file to build the PyTorchJob that's managed by Kubernetes

Hurdles for Regular Use

The big roadblocks for using this regularly are:

Transferring data to the trainers: There's no good way atm of abstracting away transferring the training data. Your code has to do it. You can put that in the python (make a data_download() that grabs it from MinIO) or as some mc calls in a .sh launch script. You could also pre-load a PVC with the data and then attach those and access it through the file system (that is what the above demo shows), but because we use ReadWriteOnce (RWO) storage (eg: can only mount to a single node at a time) this doesn't scale past two nodes total. Each node needs a unique PVC due to the RWO, and there's no easy way to tell workers to get their own PVCs (note in the above yaml how we specify the same PVC for all workers).
Building/iterating on Docker images: Kubernetes is all about the docker image. The operator is designed expecting you to be able to iterate on your own docker images, but that isn't the model in AAW (security reasons). The easiest workaround I can think of is to make a base image with your packages/binaries that has a very generic launch script. The launch script could be something like:
```
git clone REPO_FROM_ARGS /tmp/downloaded_repo
cd /tmp/downloaded_repo
pip install requirements.txt
python main.py ALL_OTHER_ARGS
```
This way you can iterate on the code in the repo without having to change the Docker image much
You need to know some kubernetes to use this: It isn't so hard (the yaml is pretty straight forward), but interacting with these jobs needs some kubectl. Same with monitoring them (describing pods/PyTorchJob objects, checking logs, etc)

Results with Multiple GPUs

I definitely saw speedup for some cases, but it was situational. For toy problems I saw near linear speedup, but for the construction starts project it was closer to 0.25-0.5 linear (eg: 2GPU is 1.25-1.5x as fast as 1GPU). This probably could have been improved by doing some learning rate tweaking (GPU scaling like this effectively scales the batch_size, which would let you do a larger learning rate). I played with it a little but not a ton.

ca-scribner commented 3 years ago

cc @chritter

ca-scribner commented 3 years ago

Training from S3 Storage

Along the way during this task when trying to find a better way to handle transferring training data to each worker, I also set up a trainer that trained directly from S3 storage rather than from local files. The objective there was to make it so you never need to care where your files are.

The effort was a partial success, but with more work I think it could be fully successful. I successfully trained directly from S3 (see this example. The only problem is that the current implementation cannot be parallelized. Usually these data loaders get speed through parallelization (I see your typical setup has data loader workers loading your images to feed the GPU), but the way they parallelize breaks the way I passed an S3 client to the loader*. So the current demo trains slowly, but that should be overcome by fixing the parallel data loader problem. We could also add things like local caching of images, etc, if it was a real problem.

* the data object pickles itself and passes that pickle to each parallel loader. But, a logged in boto3 client cannot be pickled! So if we set num_workers>1 atm, it'll break with a funny error. I think the way around it is to instead pass credentials rather than a logged-in boto3 client, then the loader logs itself in when it starts fetching data. Should be an easy fix (I think there's some fixtures in the data loaders that help this) but didn't have time to do it. ** I think this is easier in newer versions of fast.ai/pytorch. The older APIs made this all a bit harder

StatCan / aaw

Support Construction Starts project by investigating running fastai/pytorch jobs in distributed multi-GPU setup #347

Summary

How to do Data Distributed Parallel

Hurdles for Regular Use

Results with Multiple GPUs

Training from S3 Storage