flux-framework / flux-k8s

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
Apache License 2.0
20 stars 10 forks source link

refactor: testing idea to wrap coscheduling #69

Closed vsoch closed 1 month ago

vsoch commented 4 months ago

This is the "skeleton" of a new idea to wrap coscheduling, where we will add in the logic for fluence only where it is needed, likely in the PodGroup (in the new fluence/core/core that wraps the same in coscheduling). This is just a skeleton because we are deploying the sidecar with the wrapped scheduling and absolutely no logic ported over to AskFlux. I think I have a sense of where to put this, but wanted to save this vanilla/skeleton state in case we need to go back to it. Note that it did not work to have fluence inherit the functions from coscheduler, so I opted for a strategy of adding it as a helper field, and then just using it when necessary (a call to f.cosched).

Note that this has no logic to work yet, so the tests will obviously fail. I want to open it so I can clearly see changes here (and update in logical, incremental steps).

Update: this doesn't seem to work, so we will need to duplicate a lot. I pushed again and have included all functions, and also restored functionality for the pod group to be cleaned up by the controller (easier development for me).

Second update: this is now working without clogging for these experiments: https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run7

Update: this has been updated with https://github.com/flux-framework/flux-k8s/pull/74 which:

We will want to, before we do a final merge here, to merge https://github.com/flux-framework/fluxion-go/pull/8 and then do a release that we update the digest for here. Note that fluence seems to be working OK so we likely can do this soon. My next step is working on other tools to supplement the experiments. But I'm not sure fluence in master (as a new build) would work now with upstream (actually I am sure it will not unless an older version is used).

This will close #71 #73 #59

vsoch commented 3 months ago

I'm testing another idea to only allow scheduling the first in the group, that way the next in the group get seen right after that. It's too bad we can't selectively grab pods from the queue (or maybe we can)? Otherwise we could just get any pod scheduled and then assign (actually assign, not just declare the node and wait for the pod to show up again) the pod to the node.

vsoch commented 3 months ago

It's running now (and been running). I can't back this up with data, but it seems to not be doing the "freeze for a while and then kick in" like an old car thing. Will update if/when it finishes. I doubled the size of the experiment so it's size chonkmaster.

Side note - we need a chonker scale for experiments. :cat: :cat2:

LAMMPS?   A Fine Boi!
AMD?      A Heckin' Chonker
MuMMI?    OH LAWD HE COMIN'!
vsoch commented 3 months ago

I'm not sure why there are conflicts here now - going to try working on a separate branch and see what's up.

vsoch commented 3 months ago

Ah, no need - it was rebased with main and the PR was open to another (now deprecated) branch. Updating the PR branch to be against main resolved that (and showed the other large amount of work that preceded this one).

vsoch commented 2 months ago

@milroy I went through and changed most of the short variables (e.g., pgMgr) to be fully written out. I left a few that are convention for operators (reconcilers) like "mgr" (manager) and "req" (request). The one I didn't change from the original code was "ps" primarily because I don't know what that means. Let me know if there are others you'd like expanded. Thanks for starting to take a look!

milroy commented 2 months ago

I'm in the process of reviewing this PR, but it is taking quite a while. There's a lot of churn in the commits, e.g., fluence.go is updated 17 times across commits. That makes it difficult to track the desired final state and also causes me to wonder if my comments will be obviated by a later commit.

vsoch commented 2 months ago

I'm in the process of reviewing this PR, but it is taking quite a while. There's a lot of churn in the commits, e.g., fluence.go is updated 17 times across commits.

I'm really sorry it's so big - it's added up over many months, and understandably that sucks to review. I won't push anything until you are finished, and we can talk about the best way to work on it to minimize reviewer pain.

That makes it difficult to track the desired final state

Can you focus on review the final state (not each commit)?

and also causes me to wonder if my comments will be obviated by a later commit.

I think if I were to push the comments would say outdated, but they wouldn't go away (we could still see them).

vsoch commented 2 months ago

@milroy commit with response to review is here: https://github.com/flux-framework/flux-k8s/commit/cbeffceb04502a22396da984f620e8f9cd9ff99a

I left the comments unresolved that need further conversation, or just the ones that were more fun that you should see. :baseball: :hook: I do see some issues with the design of the JGF, but I'm not sure if there are functional implications. I updated an issue here to reflect those thoughts: https://github.com/flux-framework/flux-k8s/issues/71 (and please add other bullets to it). Let's discuss the comments here as much as we can, and (if we can meet this Wednesday) finish up discussion then. I will do one final test of experiments I've run before using fluence to ensure that everything is still working before we merge.

milroy commented 1 month ago

Thank you very much for the quick and detailed work addressing my comments. This is a tremendous improvement, and makes Fluence far more useful and stable.

I'm approving the PR, but let's discuss a few points and test the changes as you suggested before merging.

vsoch commented 1 month ago

To update from my post in slack (which will be lost into the black hole of segfaults):

Fluence experiments are done - I did three sizes each one batch and varying iterations where one iteration == 5 jobs ranging in size from 2 to 6. Small was 3 of those, medium was 5, and large was 10. No clogging in any case, and the one flaw we have (that we know about) is that when the queue gets much longer, you have to wait longer to cycle through the entire thing and schedule the next job. So the large size took about 38 minutes. But the default scheduler would fail after a few minutes (clog) so we are optimizing for that - no clogging.

And a follow up question I had this morning to think about: why do we use the Kubernetes queue, period? Why can't we do what flux does with the planner and schedule the job for a set of nodes in the future? Then we wouldn't need to process through the same jobs continuously. It would improve things quite a bit I think. I suspect we are missing a flux component.

I think we are good to merge if someone wants to do the honors, or I can. It's not perfect and there is more to do, but this is a first step. :footprints:

milroy commented 1 month ago

And a follow up question I had this morning to think about: why do we use the Kubernetes queue, period? Why can't we do what flux does with the planner and schedule the job for a set of nodes in the future? Then we wouldn't need to process through the same jobs continuously. It would improve things quite a bit I think. I suspect we are missing a flux component.

We've started discussing improvements and changes to Fluence along these lines, and let's continue in a separate issue. For now, I'm merging this PR.