StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
68 stars 12 forks source link

Investigate / review / setup node selection logic #1970

Open vexingly opened 1 month ago

vexingly commented 1 month ago

The nodepools will be updated in https://github.com/StatCan/aaw/issues/1967 but we will require some logic for node selection.

  1. Keep existing logic for notebooks that have higher than 14 CPU to be launched on the "usercpuXXxx" nodes
  2. Add taint/toleration for scheduling OpenM++ jobs to use these larger CPU based nodes
  3. Investigate other methods for scheduling different scenarios? The UAT will provide more details on how users schedule their jobs that could be helpful for working out a longer term strategy

Existing logic for notebooks is here: https://github.com/StatCan/aaw-toleration-injector/blob/main/mutate.go

jacek-dudek commented 1 month ago

Started looking at the code in the toleration injector controller. Will study it some more and post follow up questions here and request comments from Pat and other people

jacek-dudek commented 1 month ago

Studied some alternative methods of node selection. There is one based on nodeSelector field and node labels, another one based on affinity field that allows for more expressive conditions on node labels and also distinguishes between required and preferential conditions, finally there is one based on node taints and corresponding tolerations.

jacek-dudek commented 1 month ago

Pat, could you comment on what types of kubernetes workloads are expected to be openm workloads? How will they be distinguished from other workloads? Do we have a set of labels in mind that will be applied to the pod manifests?

And do you prefer a particular node selection method to be used over others (ie toleration injection versus nodeSelector or affinity specified in pod specs)?

vexingly commented 1 month ago

Hi @jacek-dudek, I think there are two workloads to consider:

  1. Users creating notebook servers in kubeflow and running their workload directly in the notebook: I think if we are moving to 16CPU default nodes, then the current logic can probably stay as it is, i.e. users are restricted to 14CPU notebooks and perhaps we don't allow them to use more than that with this type of workload

  2. Users who want to submit a kubernetes job or mpijob using a specific manifest (ether manually or via the OpenM++ UI and a template, I would prefer to keep using labels, like the big-cpu label.

I think we would need a new openm/microsimulation specific label to target a d64 node pool, is that what you were thinking @Souheil-Yazji ?

Souheil-Yazji commented 1 month ago

@jacek-dudek @vexingly

Just at an initial glance, it seems the best approach is to always have the users submit their Open M jobs as a separate workload. This will allow us to build the foundation for MPI jobs in the future, if that ever becomes functional.

This would also limit the cost factor for users scaling larger notebooks to run jobs but then idle resources after. If the users run their Open M jobs in isolated pods, which terminate once complete, this will be perfect for:

Whether we use a node selector label or taint/toleration isn't very problematic.

vexingly commented 3 weeks ago

The two scenarios that I can see for users not submitting the jobs as a separate workload are:

  1. Users not familiar with this workflow find it more complex and unless we can make it transparent they will have some issues adjusting / need some time to work up to a separate job workflow

  2. When doing very small runs for building scripts it would be easier / less complex to run locally, but I don't expect them to use many resources for this type of work

Souheil-Yazji commented 2 weeks ago

@vexingly Next steps for this:

vexingly commented 2 weeks ago

I think notebooks should target intermittent workloads of ~4 CPU and should over provision / expect some slowness for multiple users, non-production runs more development and testing / configuring.

When you say big-cpu do you mean the 72 core machines? Is that what we will use for the time being? They may not have enough memory for some users workloads, although the CPU's are sufficient. We will need more nodes for sure, I think each of the 4 projects have a quota of 200 CPU currently.