Open rlskoeser opened 1 month ago
First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them. Note that CAS integration (included in #10) is a prerequisite for this. ## implementation details - [ ] request access to ssh without duo from test vm to hpc machine - [ ] ensure access to ssh from vm to hpc (may require PUL .lib domain firewall change) - [ ] add a vaulted ssh key to deploy and write instructions for adding to authorized keys on hpc machine - [ ] write remote versions equivalent to gpu celery tasks to kick off training jobs: export needed data/model, use scp/rsync to transfer files and ssh to log in as the current user, start the slurm job - [ ] modify escriptorium to call our remote version of the task instead of running locally (think about how to make configurable but this version doesn't have to be elegant) - [ ] implement method to check status of remote slurm job - [ ] modify escriptorium task monitoring to handle remote slurm job - [ ] when the job completes, update the refined model back in escriptorium and report on status
First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.
Note that CAS integration (included in #10) is a prerequisite for this.
implementation details