flux-framework / flux-k8s

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
Apache License 2.0
20 stars 10 forks source link

[WIP] Updates to fluence to restore functionality and update builds #61

Closed vsoch closed 3 months ago

vsoch commented 5 months ago

This PR will likely be open for a few days while I carefully work through needed changes. I want to try and commit in a meaningful way (despite my previous commits being a bit messy I think) and have the diffs here to inspect. In particular, I plan to do the following:

Docker Builds

Miscellaneous

Resources

Services

PodGroup

Controller

Update removed testing cases, because the deployment / statefulset etc. cleanup strategies need further thinking and discussion, unlikely cases so not priority.

This is a WIP, will change the title if/when ready for review, and add points as I work on / find them. The criteria (for me) for this to be ready will be resolving the clogging use case, and consistently working for the larger test case on GCP.

vsoch commented 5 months ago

Ah this is really interesting - the PodGroup reconciler was already watching pod so there was never a need to manually create the PodGroup - given the label, it would have been created for us. The subtle difference between then and now is that we want everything scheduled by fluence to have a group, so given our added webhook that adds the label, we would only hit this case if a Pod came through that was scheduled by the default scheduler (because it wouldn't be using the controller alongside fluence, but would still be watching all Pods in the cluster). But maybe there is a use case to do that for combining across objects with pods, or asking for a different size than the number of pods in some set. Will think more about it tomorrow / later today. :zzz:

Ah I don't think that's right (was looking at my updated code) - in the original it would hit here and not be found, so it does need something to create it.

vsoch commented 5 months ago

See last commit message for recent update: https://github.com/flux-framework/flux-k8s/pull/61/commits/000baac47dd77e36f9cbee455b7509bfa5dfcb02

The PR now supports the full pipeline of a webhook adding labels, a watcher on pod to get them and trigger a PodGroup reconcile, and creation of the PodGroup. In a diagram:

apply -> 
  podgroup webhook -> 
    add labels (size/name) -> 
      watch on pod in PodGroup reconciler --> 
        if Pending -> PodGroup

Here is a an example PodGroup created by a Job that has completions/parallelism set to 3 - we see that the size reflects that.

image

And then the PodGroup is created and enters the reconcile process akin to any other CRD. Note that I also decided to keep / use the orginal label for the name - I didn't see a strong reason to change it, and was worried about potential bugs that having two might cause in parts of the code we don't reproduce / aren't familiar with.

At this point we should have enough to (finally) integrate this back into fluence, of course the changes are now that the PodGroup creation (size detection) is automated, the timestamp is updated to be Microseconds, and we have better ownership / control over how the group and associated metadata is handled (good for future ideas I think).

Still no guarantee any of this will work, but i can say that we haven't failed yet.