I'm coming to the point where I want to do lots of training runs in parallel. I've got a lil bit of vast.ai orchestration code already that I'm pretty happy with, but it's a long way short of 'easy to use'.
I could go find a big ol map-reduce framework, but it seems a bit much for my pretty trivial use case of 'separate job on every GPU, pull the output folders back to the master'. So! Roll-my-own time:
Principles
Use SSH and rsync
Keep all the state in files and all the processes short-running, so it's innately robust to the user's machine crashing.
Use json files so it's all transparent and easy to hack on
No-gos:
Machine setup
Experiment organisation
Jobs that span more than one machine
Submission
Users runs a submission script.
User passes a command to run, along with an optional directory path and and optional hardware requirements dict.
At submission time, the directory path gets zipped up into an archive; this'll be dropped into the working dir at runtime.
Obey .gitignore, maybe modulo some user-passed includes/excludes?
By swapping out files in the dir each time they submit, the user can set up various experiments
The passed parameters get appended to a state.json file, along with a note saying it hasn't been dispatched yet.
Dispatch
User runs a manager script.
Manager starts up, reads a machines.json config file giving it machine locations and credentials, queries them for their available resources.
Up to the user to write the machines.json file out.
Might need a dummy for the local machine.
Manager checks the state.json file to see what jobs might already be running, and checks the state of those remote processes. If they are still running, deduct their reqs from the available resources on that machine.
Else mark the submission completed.
Manager checks the state.json file to see if there are any jobs that could be run on these machines.
If there are, manager uses rsync to copy the archive over, SSH to unzip it and run the command.
Stdout/stderr are piped to files.
Maybe this should be left up to the user? It's intrusive to take control of this, but risky not to.
Manager adds a note of the machine ID and process ID for the job to the state.json file, so it can find it in future, along with a note that the job's active.
When all the resources are consumed or all the jobs are submitted - whichever happens sooner - the manager shuts down. Should only take a few seconds end-to-end.
Add some sugar for running it in a loop.
Monitoring
User runs a monitoring script
Monitor loops through the active jobs based on state.json, rsyncs back the stdout + displays it in the terminal
Maybe should let the user specify log file/output locations when they submit the job?
Monitor checks what PIDs should be active, alerts if one's disappeared.
I'm coming to the point where I want to do lots of training runs in parallel. I've got a lil bit of vast.ai orchestration code already that I'm pretty happy with, but it's a long way short of 'easy to use'.
I could go find a big ol map-reduce framework, but it seems a bit much for my pretty trivial use case of 'separate job on every GPU, pull the output folders back to the master'. So! Roll-my-own time:
Principles
Submission
state.json
file, along with a note saying it hasn't been dispatched yet.Dispatch
machines.json
config file giving it machine locations and credentials, queries them for their available resources.machines.json
file out.state.json
file to see what jobs might already be running, and checks the state of those remote processes. If they are still running, deduct their reqs from the available resources on that machine.state.json
file to see if there are any jobs that could be run on these machines.state.json
file, so it can find it in future, along with a note that the job's active.Monitoring
state.json
, rsyncs back the stdout + displays it in the terminal