Consider running a separate qmanager for each queue

trws commented 4 months ago

@garlick had a really interesting idea to run separate instances of our modules to handle separate queues. I had originally thought about this in terms of the resource module, because it would greatly reduce the complexity there, but it also has a severe downside there in that queues could never share resources. On the other hand, running multiple instances of qmanager would simplify things slightly less, but still some, and it would mean all the queues could process requests in parallel. There's no shared state required between queues, so as far as I can think all it would cost is a (pretty small) amount of ram and some TIDs. Would love thoughts on this. If we do the protocol tweaking we've discussed we could even have it be that each queue is assumed to be an instance of a service, which would make it easy to load others side-by-side.

grondo commented 4 months ago

@ryanday36 might want to comment. I think we're using overlapping queues in production, but I don't know if both are ever active at the same time, so maybe it doesn't matter.

ryanday36 commented 4 months ago

The overlapping queues are ideally never active at the same time, though if this were to be implemented it would be good to also put some controls in flux-core to ensure that a queue can't be activated if it would overlap with an already active queue.

The main door that this would close would be the ability to use overlapping queues to implement different policies (limits, priorities, preemption, etc. The things needed for 'standby', 'exempt', and 'expedite') for different sets of users on the same nodes a la LSF. Slurm keeps those concerns largely separate (via QoS), and an approach like that probably makes sense for how Flux is architected. Policy things like limits and priorities should probably be wholly contained in flux-accounting with sched just concerned with matching jobs to resources.

That said, it does still make me a little bit uncomfortable to close off paths before the design work for things like https://github.com/flux-framework/flux-core/issues/5739, https://github.com/flux-framework/flux-core/issues/5205, and https://github.com/flux-framework/flux-core/issues/4306 has been done. It seems like those things should be doable in flux-core and flux-accounting with non-overlapping queues, but I'd feel better about it if we had a more concrete idea of what that will end up looking like.

ryanday36 commented 4 months ago

A couple more small, but important questions. As @grondo mentioned above, we don't have overlapping queues active at the same time, but they can be enabled at the same time to that users can submit jobs to inactive queues that will be run later. Would running separate qmanagers still allow us to do this? Also, when we switch active queues, it doesn't currently affect running jobs (they continue to run until they complete or hit their time limit). Would you still be able to allow jobs in one queue to continue if you switched to a separate queue with a different qmanager?

garlick commented 4 months ago

Great points @ryanday36 and I agree we should not be too hasty here and should revisit those use cases.

It is sort of appealing to have a queue be directly associated (1:1) with a scheduler instance though.

One thought regarding overlapping queues is maybe we could have each scheduler instance "acquire" the full resource set but only mark the queue's configured set "up" in each scheduler instance/queue. Then when the queue configuration changes dynamically, mark them "down" in the donor, but not "up" in the recipient until they are no longer allocated.

trws commented 4 months ago

There's a good bit to unpack and explain here, sorry if this ends up being verbose. There are two different but related suggestions going on here. We currently have a system like this:

flowchart LR
  subgraph qmanager-module
    queue1
    queue2
    queue3
  end
  subgraph resource-module
    subgraph properties
    q1
    q2
    q3
    end
  end
  queue1-->q1
  queue2-->q2
  queue3-->q3

The original suggestion, which would make dealing with queue constraints completely go away in resource was to split resources between different instances of the resource module so it would look like this:

flowchart LR
  subgraph qmanager-module
    queue1
    queue2
    queue3
  end
  subgraph resource-module-1
    subgraph properties-1
    q1
    end
  end
  subgraph resource-module-2
    subgraph properties-2
    q2
    end
  end
  subgraph resource-module-3
    subgraph properties-3
    q3
    end
  end
  queue1-->q1
  queue2-->q2
  queue3-->q3

This means there is no shared knowledge of resource status between queues, and would force us to only allow a resource to exist in one queue as a result. It would be a significant benefit in matching complexity, but might not be worth it.

What I meant to be proposing here was this split instead, or at least as a first step:

flowchart LR
  subgraph qmanager-module-1
    queue1
  end
  subgraph qmanager-module-2
    queue2
  end
  subgraph qmanager-module-3
    queue3
  end
  subgraph resource-module
    subgraph properties
    q1
    q2
    q3
    end
  end
  queue1-->q1
  queue2-->q2
  queue3-->q3

The multi-qmanager version still has a single source of truth for whether a resource is allocated or not, so there is no need to worry about resources being shared between queues (well, there are still some problems there but no fundamental problems). As it is, the queues share no state, so this should be a matter of checking and possibly tweaking the protocol to make sure each queue module gets its messages and that's about it. It wouldn't prevent us from having overlapping queues, though it might prevent us from enforcing some constraints on ordering or fairness in terms of which of the overlapping queues gets to free resources first.

On the other hand, the resource split would prevent everything you're worried about, that's why I'm not really sure that's a good idea. @garlick's last suggestion would help with that a lot actually, so that might make it feasible, but I don't think we would want to go that far, at least not until we have a better handle on consequences.

flux-framework / flux-sched

Consider running a separate qmanager for each queue #1258