Open Collinbrown95 opened 2 years ago
OpenMPI support is import for enabling large/complex workflows through OpenM++ (the modelling software used by a few teams - replaces modgen). While it may be used in development, users are generally unlikely to submit jobs at the command line. They will most likely be using the OpenM++ web UI to run model simulations and gather results.
There is an example of using OpenMPI as a non-root user which can likely be used as inspiration for adapting the out of the box scripts that are triggered to launch jobs from the web service.
Candidate PRs for adding compute-optimized node pool (short-term solution):
PR to terraform-azure-statcan-aaw-environment
- add compute optimized node pools
PR to terraform-advanced-analytics-workspace-infrastructure
- instantiate node pool in dev cluster (could make a similar PR to instantiate in prod)
PR to aaw-toleration-injector
- update toleration injector to give oncosim pods a toleration for the node.statcan.gc.ca/use=cpu-72
taint.
Edit: toleration injector should use a name prefix + what namespace the pod is scheduled to e.g. look for prefix of cpu-big
in the pod name and then add the appropriate toleration.
CC: @chuckbelisle @YannCoderre
@brendangadd can you approve the use of VM size "Standard_F72s_v2"
@Collinbrown95 Yep, that VM size is fine. Re. implementation, I'll add some comments to the injector PR.
Work for high-CPU node pool moved to https://github.com/StatCan/daaas/issues/1193
OpenM++ is now support elastic resource management to start and stop cloud servers / clusters / nodes / etc on demand. It invokes shell script / batch file / executable of your choice to start a resource(s) (server / cluster / node / etc) when user want to run the model and to stop the resource(s) when model runs queue is empty.
I am not sure, but it may help to resolve this issue with couple of couple of shell scripts. It is already deployed in cloud for OncoSimX / HPVMM customers, please take a look if you are interested.
Is your feature request related to a problem? Please link issue ticket
The problem is that OpenM++ models take an unacceptable amount of time to run with the current solution.
Describe the solution you'd like
Want users to be able to schedule OpenMPI jobs natively on Kubernetes. This effectively enables users to scale jobs with OpenM++.
Proposed Solution This repo (https://github.com/everpeace/kube-openmpi#run-kube-openmpi-cluster-as-non-root-user) has an implementation of OpenMPI in Kubernetes. Some prior work on this has already been done here https://github.com/Collinbrown95/kube-openmpp - in particular
mpiexec
entrypoint successfullyTODO
Describe alternatives you've considered
This problem can be temporarily resolved by adding big VMs to the AKS node pool. However, this has a number of drawbacks that make it unsustainable/not user friendly in the long term.
Additional context
Today, there is a remote desktop instance that has OpenM++ installed that is fine for development purposes. However, when running these models at scale, the size of the underlying node pool VMs is a limitation to how fast these models can run. Specifically, the OpenM++ workloads are CPU bound and highly parallelizable, but the maximum size of a notebook server is currently ~15 vCPUs because the underlying nodes only have 16 vCPUs.
Importantly, the users who require OpenM++ at scale are researchers, so the user interface needs to provide a non-programming interface, and OpenM++ provides this with a graphical user interface out of the box.