Investigate / review / setup node selection logic

vexingly commented 1 month ago

The nodepools will be updated in https://github.com/StatCan/aaw/issues/1967 but we will require some logic for node selection.

Keep existing logic for notebooks that have higher than 14 CPU to be launched on the "usercpuXXxx" nodes
Add taint/toleration for scheduling OpenM++ jobs to use these larger CPU based nodes
Investigate other methods for scheduling different scenarios? The UAT will provide more details on how users schedule their jobs that could be helpful for working out a longer term strategy

Existing logic for notebooks is here: https://github.com/StatCan/aaw-toleration-injector/blob/main/mutate.go

jacek-dudek commented 1 month ago

Started looking at the code in the toleration injector controller. Will study it some more and post follow up questions here and request comments from Pat and other people

jacek-dudek commented 1 month ago

Studied some alternative methods of node selection. There is one based on nodeSelector field and node labels, another one based on affinity field that allows for more expressive conditions on node labels and also distinguishes between required and preferential conditions, finally there is one based on node taints and corresponding tolerations.

jacek-dudek commented 1 month ago

Pat, could you comment on what types of kubernetes workloads are expected to be openm workloads? How will they be distinguished from other workloads? Do we have a set of labels in mind that will be applied to the pod manifests?

And do you prefer a particular node selection method to be used over others (ie toleration injection versus nodeSelector or affinity specified in pod specs)?

vexingly commented 1 month ago

Hi @jacek-dudek, I think there are two workloads to consider:

Users creating notebook servers in kubeflow and running their workload directly in the notebook: I think if we are moving to 16CPU default nodes, then the current logic can probably stay as it is, i.e. users are restricted to 14CPU notebooks and perhaps we don't allow them to use more than that with this type of workload
Users who want to submit a kubernetes job or mpijob using a specific manifest (ether manually or via the OpenM++ UI and a template, I would prefer to keep using labels, like the big-cpu label.

I think we would need a new openm/microsimulation specific label to target a d64 node pool, is that what you were thinking @Souheil-Yazji ?

Souheil-Yazji commented 1 month ago

@jacek-dudek @vexingly

Just at an initial glance, it seems the best approach is to always have the users submit their Open M jobs as a separate workload. This will allow us to build the foundation for MPI jobs in the future, if that ever becomes functional.

This would also limit the cost factor for users scaling larger notebooks to run jobs but then idle resources after. If the users run their Open M jobs in isolated pods, which terminate once complete, this will be perfect for:

costing purposes because resources will scale down after complete
monitoring/logging because of the container-level isolation, then optimize resource provisioning based on monitoring results
pushing all workloads to a different nodepool to prevent resource contention, and the nodepool can fully scale down once users are no longer working (but this does introduce annoying ~5min latency for first job)
users can omit the node selector if they want to just run the workload container on the native nodepool (which will probably be much smaller than the cpu optimized nodes)
The work used to make the UI submit MPI Jobs can be re-used, but instead, submit regular podspecs with OpenM jobs instead
in the case of AAW, the nodes which run the notebook are tainted, therefore the small jobs running on them will need those tolerations as well.

Whether we use a node selector label or taint/toleration isn't very problematic.

vexingly commented 3 weeks ago

The two scenarios that I can see for users not submitting the jobs as a separate workload are:

Users not familiar with this workflow find it more complex and unless we can make it transparent they will have some issues adjusting / need some time to work up to a separate job workflow
When doing very small runs for building scripts it would be easier / less complex to run locally, but I don't expect them to use many resources for this type of work

Souheil-Yazji commented 2 weeks ago

@vexingly Next steps for this:

[ ] Advise users on best-practices :
1. Run small jobs on their own notebook, to avoid large costs for scaling up expensive infra.
2. Run Big Jobs using custom OpenM Job Manifests
3. Define what "Small" and "Big" are or at least provide a suggestion
[ ] Create a custom OpenM manifest template, which is submit-able by end-users, which includes the appropriate labels/tolerations to schedule the jobs to the Big-Cpu nodepool, which is currently only at 1 Node per pool. Either:
1. Add a Nodepool with a different VMSS type:
2. Increase NodePool limit for Big CPU

vexingly commented 2 weeks ago

I think notebooks should target intermittent workloads of ~4 CPU and should over provision / expect some slowness for multiple users, non-production runs more development and testing / configuring.

When you say big-cpu do you mean the 72 core machines? Is that what we will use for the time being? They may not have enough memory for some users workloads, although the CPU's are sufficient. We will need more nodes for sure, I think each of the 4 projects have a quota of 200 CPU currently.

StatCan / aaw

Investigate / review / setup node selection logic #1970