flux-framework / flux-k8s

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
Apache License 2.0
22 stars 10 forks source link

Post refactor changes needed #71

Open vsoch opened 6 months ago

vsoch commented 6 months ago
milroy commented 4 months ago

I agree with these items. Here's more detail here on the containment / JGF format issues from PR 69: https://github.com/flux-framework/flux-k8s/pull/69#discussion_r1609438917

vsoch commented 4 months ago

Adding a note for myself:

Here https://github.com/flux-framework/flux-k8s/blob/33ab097efdde140962f3aa0bfd2b7e1dc29fbd3a/src/fluence/utils/utils.go#L144-L149 where we create a subnet, it is within a listing of nodes. My understanding of subnet zones (which I need to verify with a node listing) is that we can have multiple nodes under the same subnet. So for this part of the code, we likely need to create a lookup of subnets by name, and only create another one if we haven't already created it. Or have them created in a separate outer loop first, both would be equivalent. I don't think we do anything with this in the current implementation (so likely no issue) but if we had multiple subnet zones I think we might either get duplicate nodes in the graph with different indices, or something like that. Wanted to put a note here so I don't forget (have a lot in my head at the moment).

OK going to try this:

image

vsoch commented 3 months ago

@cmisale @milroy I'm working on the second bullet above, and wanted to have discussion about the format that we want. We currently do something like (and please correct me if I'm wrong - I get this confused with jobspec nextgen):

version: 1
resources:
  - type: slot
    label: default
    count: 2
    with:
    - type:  core
      count: 16
tasks:
  - command: [ "app" ]
    slot: default
    count:
      per_slot: 1

With memory / GPU added if it's defined for the pod. And that is done via parsing one container and then having the slot->count be the number of nodes (I think). If we parse each container individually (which assumes they might be different) I'm wanting to know what that should look like? The only thing that made sense to me was to move the count down to the node, and then to be able to say how many of each node is requested:

version: 1
resources:
  - type: slot
    label: default
    with:
    - type: node
      count: 1
      with:
        - type:  core
          count: 4
        - type: gpu
          count: 1
    - type: node
      count: 4
      with:
        - type:  core
          count: 16

tasks:
  - command: [ "app" ]
    slot: default
    count:
      per_slot: 1

But I remember there was a reason for having the slot right above the core (and not including the nodes) so I think that might be wrong. That said, I don't know how to enforce a design here with nodes of different types, because the approach that asks for "this many CPU across whatever resources you need" doesn't well capture the multiple (different containers).

If possible, let's have some discussion on the above! I have the next PR well underway but I paused here because I wasn't sure.

cmisale commented 3 months ago

hm I have to say I don't remember that well how to define jobspecs.. I was much better before lol. That said, I'm not convinced we need to count all pods in a group. Do pods in a group belong to the same controller id? If so, they all have the same requirements and we can just for ask n slots where n is the number of pods. This assumes I am understanding the question correctly, which might not be the case :D

vsoch commented 3 months ago

I think if we want to ask fluxion for the right resources, and if the pods vary slightly, we might need to customize that request for the pods that we see. For example, let's say the group has two pods that request 32 cores each, 1 gpu (some ML thing), and then 2 more pods that just need 16 cores and no gpu (some service). Up until this point we have used a "representative pod" and then multiplied it by count (maybe all 4 pods require 32 cpu), and in practice that is the most likely use case (homogeneous clusters). But we could very easily, for example, have this "potpourri of pods" that need different resources for the applications running within. This use case is more an orchestrated idea, maybe 2 pods running an application, and 2 running a service. The homogenous use case is more for different ranks of an MPI application.

At least that is my understanding - I think @milroy probably can better comment!

vsoch commented 3 months ago

hm I have to say I don't remember that well how to define jobspecs.. I was much better before lol.

@cmisale I forget almost every period between working on them and have to remind myself what the heck a "slot" is... :laughing:

This is me reading in my vast library of books about flux trying to answer that question...

image

I started reading in my late 30s, and I'm still not sure what "total" vs "per_slot" is, but likely someone will figure it eventually. I humbly request it on my tombstone - "Here lies Vanessa, she wanted a "per_slot" task for her test run of lammps.