flux-framework / flux-k8s

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
Apache License 2.0
20 stars 10 forks source link

update: readme for WIP changes to fluence #42

Closed vsoch closed 3 months ago

vsoch commented 7 months ago

Problem: our current fluence plugin is a bit behind the current kubernetes-sigs/scheduler-plugins, and resulting in the errors shown here: https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run0#example-scheduling.

Today I felt brave (or just stupid) and decided to take another look. Our main issue was getting the openshifft-psap plugins fluence branch up to date with > 100 commits from upstream kubernetes-sigs/scheduler-plugins. I first attempted a proper rebase, knowing this is typically desired. But after about an hour (once it got into the vendor folder) it became clear this was not a good approach. At least, someone that worked on the entire history would need to devote days to it (at least it seems). I did Google searching for "how to rebase with 100s of commit changes" and largely the suggestion was to do an old school merge - meaning that we would still need to resolve conflicts, but only once. I kept a record of the conflicts and some notes here: https://gist.github.com/vsoch/3c2b6d69607cab68de057ccbd003adeb. The strategy I decided on today was the following:

  1. Start with a merge of kubernetes-sigs/scheduler-plugins into the current fluence. I did this via my own repository and branch
  2. As carefully as possible, resolve merge conflicts, taking into account the upstream and the fluence branch. In practice this was easier to do by way of nuking the entire vendor/ directory and using the upstream in kubernetes-sigs. It's hugely unlikely that we would have made changes there. For each change, I looked at upstream and fluence, and decided on the best action to take.
  3. It looks like the upstream tests are run with prow, and I'm not sure what triggers them but I found obviously named scripts to run unit tests and integration tests in hack (see below).
# Setup envtest in TMP
./hack/install-envtest.sh 

# Run unit and integration tests
./hack/unit-test.sh
./hack/integration-test.sh

This resulted in a lot of errors that led me to find issues discussed below. I was able to fix all the issues and get both the unit and integration tests working. See notes in the next section for more details.

Notes

Changes to packages

For some reason this merge left many changed files in the openshift repo that were not in upstream. It actually looks like this openshift update undid changes that persisted (and still exist) in the kubernetes-sig master branch. The primary changes were for associated packages that didn't seem relevant for fluence, and then functions in the coscheduling / capacityscheduling packages to just return two values instead of one (and one always nil). Seeing that the fluence branch didn't directly edit them, I assume these are old changes and I updated them.

The other change that was wrong was simply that fluence.go was (for some reason) renamed back to kubeflux.go. I'm pretty sure I did this rename originally so likely I just did something wrong. The commit history looks OK, so I just renamed it again commit. Another set of eyes would be good here.

Testing

I went through a rebuild of the fluence image from flux-k8s and with this updated repository, and the error reported above seems to be gone, and fluence is running again, at least superficially!

image

image

So that is good. The pull request here is primarily to facilitate discussion/planning, and a WIP to make some small tweaks (primarily docs/the README) that likely will be updated when we decide on next steps (and a future strategy, see next section).

Next Steps

These are my suggested next steps:

  1. Figure out how prow determines tests, and ensure that all tests are indeed passing on our fixes https://github.com/researchapps/scheduler-plugins/pull/1
  2. Let's next talk about what warrants a separate branch of the kubernetes sigs repository to begin with. If it's just adding the fluence module in pkg that is important, I wonder if there is a way to maintain this separately (meaning not with scheduler-plugins) and then, for example, to bring in the core of scheduler-plugins from kubernetes-sigs on build. Actually, now that the update is done we can compare them side by side and it's only 21 files that are different! OK - so I think we should try to separate these.
    • Easiest case: we maintain all those 21 files somewhere else.
    • Ideal case: we understand the choice to recreate coscheduling with Fluence, and try to use interfaces to inherit more functionality from the parent class (if possible).
  3. When they are separated, we should have regular CI that combines them, and tests the build and running pods. This will ensure when something changes or breaks we know about it, and soon, and can fix it.
  4. If we can separate the two, we need to decide where they will live. Ideally we could have all assets in flux-k8s (here) and not need to bring in a third repository (in addition to the kubernetes-sigs repository). Either way, I do not think we should be relying on a fork of kubernetes-sigs/scheduler-plugins. It's too much complexity.

Since I have functioning fluence images, I am unblocked from setting up testing for Google Cloud (this was the original failure). I will try that again this evening, because I'm excited (and hoping that it works). I will update this PR with what I learn.

@cmisale @milroy this is primarily for your FYI. The main work is here: https://github.com/researchapps/scheduler-plugins/pull/1

I tried my best to have clear, succinct / modular commits (per flux standard), but I'm still fairly bad at that.

image

I think there is a lot to pick up / learn about how custom scheduling plugins work, and I'm just at the tip of the iceberg. If I did this all wrong / it ultimately isn't right I apologize for the noise! It definitely was fun... I think I was looking for something complex to stick my head in today (cue ostrich visual) and it definitely delivered! :raised_hands:

Update: I was able to deploy fluence to google cloud, and deploy the same sample pods (and confirm fluence schedules them). I installed the MPI operator and tried a test run of a previous lammps container - likely there is something wrong about the architecture (intended for aws or other) against the Google node and my system, because I got "Illegal instruction (core dumped)" both testing with the MPI operator and pulling the container and just running lmp on my local machine (notes at the bottom of the README here) https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run1. I'm not particularly worried about this because I suspect we will build newer containers and also use the Flux Operator. I'll post an update in the appropriate chats about next steps.