Closed ca-scribner closed 3 years ago
For general distributed pytorch GPU job status, see #360. For construction starts specific details, read below
I was able to get a fastai v1.58.0/pytorch v1.4 (versions used by construction starts) job to run in Data Distributed Parallel mode (DDP, the recommended mode for distributing training across GPUs either on the same or separate nodes) across two GPUs (two nodes of 1GPU each) on AAW. There were a few hurdles along the way as discussed below.
In the short term, this does work but takes some effort and is hard without some extra dev privileges (mainly being able to build your own custom images). In the longer term, a little extra dev work on our side would make this very usable (a python/shell launch utility could handle the yaml syntax so people don't need to know it, maybe some tooling to help with data transfer, and some procedures around iterating on code without needing new images to be built).
Whether distributed training actually reduces training times (eg: whether jobs scale with more GPUs) is another question and very problem specific :)
A generic walkthrough of distributed training is under construction in our docs (note this link is to a branch because this hasn't merged into master yet. If you get a file not found, check the master branch /docs/en/1-Experiments/Distributed-Training.md)
This link describes things more, but I've also uploaded some examples to the Construction Starts Modelling repo:
.py
file for training across multiple nodes. It requires the code to be built into a docker image (there could be some workarounds - let me know if you actually want to use it), and we kubectl create
the yaml file to build the PyTorchJob that's managed by KubernetesThe big roadblocks for using this regularly are:
data_download()
that grabs it from MinIO) or as some mc
calls in a .sh
launch script. You could also pre-load a PVC with the data and then attach those and access it through the file system (that is what the above demo shows), but because we use ReadWriteOnce (RWO) storage (eg: can only mount to a single node at a time) this doesn't scale past two nodes total. Each node needs a unique PVC due to the RWO, and there's no easy way to tell workers to get their own PVCs (note in the above yaml how we specify the same PVC for all workers). git clone REPO_FROM_ARGS /tmp/downloaded_repo
cd /tmp/downloaded_repo
pip install requirements.txt
python main.py ALL_OTHER_ARGS
This way you can iterate on the code in the repo without having to change the Docker image much
kubectl
. Same with monitoring them (describing pods/PyTorchJob objects, checking logs, etc)I definitely saw speedup for some cases, but it was situational. For toy problems I saw near linear speedup, but for the construction starts project it was closer to 0.25-0.5 linear (eg: 2GPU is 1.25-1.5x as fast as 1GPU). This probably could have been improved by doing some learning rate tweaking (GPU scaling like this effectively scales the batch_size, which would let you do a larger learning rate). I played with it a little but not a ton.
cc @chritter
Along the way during this task when trying to find a better way to handle transferring training data to each worker, I also set up a trainer that trained directly from S3 storage rather than from local files. The objective there was to make it so you never need to care where your files are.
The effort was a partial success, but with more work I think it could be fully successful. I successfully trained directly from S3 (see this example. The only problem is that the current implementation cannot be parallelized. Usually these data loaders get speed through parallelization (I see your typical setup has data loader workers loading your images to feed the GPU), but the way they parallelize breaks the way I passed an S3 client to the loader*. So the current demo trains slowly, but that should be overcome by fixing the parallel data loader problem. We could also add things like local caching of images, etc, if it was a real problem.
* the data object pickles itself and passes that pickle to each parallel loader. But, a logged in boto3
client cannot be pickled! So if we set num_workers>1
atm, it'll break with a funny error. I think the way around it is to instead pass credentials rather than a logged-in boto3
client, then the loader logs itself in when it starts fetching data. Should be an easy fix (I think there's some fixtures in the data loaders that help this) but didn't have time to do it.
** I think this is easier in newer versions of fast.ai/pytorch. The older APIs made this all a bit harder
Goals: