Open vexingly opened 1 month ago
Started looking at the code in the toleration injector controller. Will study it some more and post follow up questions here and request comments from Pat and other people
Studied some alternative methods of node selection. There is one based on nodeSelector field and node labels, another one based on affinity field that allows for more expressive conditions on node labels and also distinguishes between required and preferential conditions, finally there is one based on node taints and corresponding tolerations.
Pat, could you comment on what types of kubernetes workloads are expected to be openm workloads? How will they be distinguished from other workloads? Do we have a set of labels in mind that will be applied to the pod manifests?
And do you prefer a particular node selection method to be used over others (ie toleration injection versus nodeSelector or affinity specified in pod specs)?
Hi @jacek-dudek, I think there are two workloads to consider:
Users creating notebook servers in kubeflow and running their workload directly in the notebook: I think if we are moving to 16CPU default nodes, then the current logic can probably stay as it is, i.e. users are restricted to 14CPU notebooks and perhaps we don't allow them to use more than that with this type of workload
Users who want to submit a kubernetes job or mpijob using a specific manifest (ether manually or via the OpenM++ UI and a template, I would prefer to keep using labels, like the big-cpu label.
I think we would need a new openm/microsimulation specific label to target a d64 node pool, is that what you were thinking @Souheil-Yazji ?
@jacek-dudek @vexingly
Just at an initial glance, it seems the best approach is to always have the users submit their Open M jobs as a separate workload. This will allow us to build the foundation for MPI jobs in the future, if that ever becomes functional.
This would also limit the cost factor for users scaling larger notebooks to run jobs but then idle resources after. If the users run their Open M jobs in isolated pods, which terminate once complete, this will be perfect for:
Whether we use a node selector label or taint/toleration isn't very problematic.
The two scenarios that I can see for users not submitting the jobs as a separate workload are:
Users not familiar with this workflow find it more complex and unless we can make it transparent they will have some issues adjusting / need some time to work up to a separate job workflow
When doing very small runs for building scripts it would be easier / less complex to run locally, but I don't expect them to use many resources for this type of work
@vexingly Next steps for this:
I think notebooks should target intermittent workloads of ~4 CPU and should over provision / expect some slowness for multiple users, non-production runs more development and testing / configuring.
When you say big-cpu do you mean the 72 core machines? Is that what we will use for the time being? They may not have enough memory for some users workloads, although the CPU's are sufficient. We will need more nodes for sure, I think each of the 4 projects have a quota of 200 CPU currently.
The nodepools will be updated in https://github.com/StatCan/aaw/issues/1967 but we will require some logic for node selection.
Existing logic for notebooks is here: https://github.com/StatCan/aaw-toleration-injector/blob/main/mutate.go