StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
68 stars 12 forks source link

Properly figure out OpenMPI on Kubernetes #1072

Open Collinbrown95 opened 2 years ago

Collinbrown95 commented 2 years ago

Is your feature request related to a problem? Please link issue ticket

The problem is that OpenM++ models take an unacceptable amount of time to run with the current solution.

Describe the solution you'd like

Want users to be able to schedule OpenMPI jobs natively on Kubernetes. This effectively enables users to scale jobs with OpenM++.

Proposed Solution This repo (https://github.com/everpeace/kube-openmpi#run-kube-openmpi-cluster-as-non-root-user) has an implementation of OpenMPI in Kubernetes. Some prior work on this has already been done here https://github.com/Collinbrown95/kube-openmpp - in particular

TODO

Describe alternatives you've considered

This problem can be temporarily resolved by adding big VMs to the AKS node pool. However, this has a number of drawbacks that make it unsustainable/not user friendly in the long term.

  1. Users would need to shut down / start up their notebook servers every day to avoid prohibitively expensive cloud bill (e.g. leaving pod with 100vCPUs allocated idle for 24 hours).
  2. Starting up a notebook server every day to run experiments would be time consuming and waste several hours per week for researchers (e.g. 10 min for the pod to be scheduled, start-up time to log in and get their environment set up, etc.)
  3. Possible additional latency to allocate a large VM to the cluster (maybe it takes longer to add a large uncommon VM to the cluster than a more common commodity VM)?.

Additional context

Today, there is a remote desktop instance that has OpenM++ installed that is fine for development purposes. However, when running these models at scale, the size of the underlying node pool VMs is a limitation to how fast these models can run. Specifically, the OpenM++ workloads are CPU bound and highly parallelizable, but the maximum size of a notebook server is currently ~15 vCPUs because the underlying nodes only have 16 vCPUs.

Importantly, the users who require OpenM++ at scale are researchers, so the user interface needs to provide a non-programming interface, and OpenM++ provides this with a graphical user interface out of the box.

goatsweater commented 2 years ago

OpenMPI support is import for enabling large/complex workflows through OpenM++ (the modelling software used by a few teams - replaces modgen). While it may be used in development, users are generally unlikely to submit jobs at the command line. They will most likely be using the OpenM++ web UI to run model simulations and gather results.

There is an example of using OpenMPI as a non-root user which can likely be used as inspiration for adapting the out of the box scripts that are triggered to launch jobs from the web service.

Collinbrown95 commented 2 years ago

Candidate PRs for adding compute-optimized node pool (short-term solution):

Edit: toleration injector should use a name prefix + what namespace the pod is scheduled to e.g. look for prefix of cpu-big in the pod name and then add the appropriate toleration.

Collinbrown95 commented 2 years ago

CC: @chuckbelisle @YannCoderre

@brendangadd can you approve the use of VM size "Standard_F72s_v2"

brendangadd commented 2 years ago

@Collinbrown95 Yep, that VM size is fine. Re. implementation, I'll add some comments to the injector PR.

Collinbrown95 commented 2 years ago

Work for high-CPU node pool moved to https://github.com/StatCan/daaas/issues/1193

amc1999 commented 2 years ago

OpenM++ is now support elastic resource management to start and stop cloud servers / clusters / nodes / etc on demand. It invokes shell script / batch file / executable of your choice to start a resource(s) (server / cluster / node / etc) when user want to run the model and to stop the resource(s) when model runs queue is empty.

I am not sure, but it may help to resolve this issue with couple of couple of shell scripts. It is already deployed in cloud for OncoSimX / HPVMM customers, please take a look if you are interested.