flux-framework / flux-k8s

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
Apache License 2.0
22 stars 10 forks source link

jgf: refactor to use shared functions and fix containment #76

Closed vsoch closed 3 months ago

vsoch commented 4 months ago

Problem: the containment paths are currently not always set, and each resource type has a separate function. Solution: create a shared node generation function, a "ResourceCounter" that uses a common resource counter, where the counter manages the global and resource-specific counts, and then can be passed to a common node generation function!

This scoped commit includes:

I am also removing the NFD features (#75) because I think they are changed and might lead to error if someone has them, and I think it will be good to assess which we want/need (there are a lot!) and then add them back strategically.

I tried really hard to scope commits, but in practice it was like "a few little tiny things" and then the bulk of work I didn't know how to break into smaller pieces, so it's one chunk.

I could add more to this PR to further work on the logic of the ResourceCounter (for example, I think we should be counting things using it and not outside of it) but I don't want to make it too big for easier review.

I'll mark this ready when it's ready for review, and add a few more notes about questions/discussion.

vsoch commented 4 months ago

@cmisale and @milroy - the changes here will tweak the Flux JGF a tiny bit so each has a containment path with (I think) correct indices. There of course could be bugs, so I'd like to look it over together (and myself again). I tried my best to keep the commits scoped, but in practice there is one large one and a few smaller ones. I will hopefully get better at this.

What I'd like to suggest we do next is review the output of the tests - specifically here (click "Run Tests" and you'll see JGF output as I update the graph) where I print a generated JGF (not from a cluster, but from the test) and talk about the changes and if the structure is OK. Then we can decide on a testing strategy, I can implement and finish, and move forward with review. After this PR I'd like to do separate PRs in the following order:

  1. Rename / reorganize utils and have the resources come from the group (instead of a representative pod)
  2. Add in additional resources we are interested in (and consider units)
  3. Add back / think through NFD

I probably can't think ahead more than three things :) I'm also going to be scoping out a design for our turducken fluence / flux-core approach, but I need to finish some other work first.

I hope you had a good memorial day weekend (I'm still not tired of using these emoji :strawberry: :cloud: :blueberries: )

vsoch commented 3 months ago

Thanks for the review! I'll wait to push (tiny changes so far) until we come to consensus on the above.