Open stackedsax opened 9 months ago
instead of a sub landscape for batch scheduling I'd really be in favor of just adding it in the main landscape (in orchestration + management)
are you cool with that? then you can PR it int he main landscape
On Fri, Feb 23, 2024 at 4:17 PM Alexander Scammon @.***> wrote:
As part of the CNCF Batch Working Group https://github.com/cncf/landscape#new-entries (part of the TAG Runtime https://github.com/cncf/tag-runtime), we'd like to discuss adding a sub landscape focused on Batch Scheduling similar to the wasm sub landscape https://github.com/cncf/landscape/issues/2387. Example Draft
To illustrate what we were hoping to do, we worked up an example Batch Scheduling landscape here:
Please note that this is merely a rough draft of what a Batch Scheduling landscape could look like. We anticipate more projects will be added as we socialize this landscape throughout the community.
If this discussion would be better in a PR, we'd be happy to submit the changes that would be necessary and we can have the discussion there. Rationale
The conversation around Batch Schedulers in the context of cloud and Kubernetes has been a complicated one over the last couple of years. As AI/ML continues to dominate discussions, the desire for solutions in this space has amplified. However, we find that people who want to solve this particular challenge often don't know where to start and don't know that there are existing options available.
As a result, companies often create their own bespoke solutions. Just about every KubeCon, another company announces that they are planning to open-source their new Batch Scheduler, often with extremely similar properties to the existing solutions. We'd much prefer to guide people to join forces on the existing solutions, ideally contributing to the conversations ongoing in the Kubernetes Batch Working Group https://github.com/kubernetes/community/blob/master/wg-batch/README.md (a sister working group the CNCF group working on k8s-specific issues) around Kueue https://github.com/kubernetes-sigs/kueue and improving the core of Kubernetes to be more Batch Scheduling-friendly.
We think adding a landscape for Batch Scheduling could help bring awareness to the community that potential solutions already exist and that they have a place to start from.
We don't intend for the landscape to answer every question people have about Batch Scheduling on Kubernetes. Much like the vast CNCF landscape itself, it will be a starting point for people to work from and do their own diligence on what will work for them.
We don't relish bringing more complexity to an already overwhelming array of options on the existing landscape (and we really appreciate the recent improvements and simplifications in the recent update). However, there did not seem to be any meaningful way of describing the current landscape of Batch Schedulers within the context of the larger landscape. We are open to ideas, of course, which is why we're reaching out for discussion.
— Reply to this email directly, view it on GitHub https://github.com/cncf/landscape/issues/3761, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPSIIWYPHULSWW6S2POFLYVEIPZAVCNFSM6AAAAABDXNOKU2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2TCOBYHE2DINQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Cheers,
Chris Aniszczyk https://aniszczyk.org
We tried to make that approach work at first and it really didn't fit. The problem is that there are a bunch of batch schedulers that need to be mentioned (Slurm/SUNK, LFS, PBS, etc.) that don't really belong in the larger cloud landscape. Yet, in terms of batch scheduling we'd like to acknowledge that there are ways of using these more traditional batch schedulers in the context of k8s.
honestly I don't mind listing SLURM and some of that in the larger scape but that's just me.
I need to figure out how to come up with some rules about sub landscapes and how to ensure we don't have TOO MANY of them.
On Fri, Feb 23, 2024 at 5:10 PM Alexander Scammon @.***> wrote:
We tried to make that approach work at first and it really didn't fit. The problem is that there are a bunch of batch schedulers that need to be mentioned (Slurm/SUNK, LFS, PBS, etc.) that don't really belong in the larger cloud landscape. Yet, in terms of batch scheduling we'd like to acknowledge that there are ways of using these more traditional batch schedulers in the context of k8s.
— Reply to this email directly, view it on GitHub https://github.com/cncf/landscape/issues/3761#issuecomment-1962118564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPSIO2X77GXX6Z2QJ5IULYVEOVBAVCNFSM6AAAAABDXNOKU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGEYTQNJWGQ . You are receiving this because you commented.Message ID: @.***>
-- Cheers,
Chris Aniszczyk https://aniszczyk.org
honestly I don't mind listing SLURM and some of that in the larger scape but that's just me.
IMO, it's better to identify the project about batch system for HPC & AI; as some projects has dependencies, e.g. volcano/kueue vs. k8s.
We're trying to propose different solutions in ecosystem with those projects.
I need to figure out how to come up with some rules about sub landscapes and how to ensure we don't have TOO MANY of them.
Do we have something like label/tag to filter projects? That may makes it easier.
We tried to make that approach work at first and it really didn't fit.
I am curious about the where and why didn't fit?
I need to figure out how to come up with some rules about sub landscapes and how to ensure we don't have TOO MANY of them.
I agree, once you open the gates, there's no going back.
Projects like Slurm do belong to orchestration + management
/Scheduling & Orchestration
, but Volcano/kueue are plugins to K8S and they don't modify the Scheduler functionality. so that do opens the question for a subcategory in orchestration + management
for these type of applications. There is an entire ecosystem being built ON TOP OF kubernetes to enable AI/ML workloads (let's leave the HPC word out of this). So it makes me think in favor of an orchestration + management
/batch + workload engines
my 2 cents...
+1 to melding this (and most other notions of "sub-landscape") into the larger one albeit w/ appropriate tag/label and ability to depict it through a lens (e.g. batch) based on query filter. Even within the context of "batch" there a a number of things that IMO ought to show up that are substantial but don't wholly live within the category (e.g. multi-cluster scheduling, gang / co-scheduling, feature discovery, DRA, et cetera are all relevant but certainly not confined to batch).
As part of the CNCF Batch Working Group (part of the TAG Runtime), we'd like to discuss adding a sub landscape focused on Batch Scheduling similar to the wasm sub landscape.
Example Draft
To illustrate what we were hoping to do, we worked up an example Batch Scheduling landscape here:
Please note that this is merely a rough draft of what a Batch Scheduling landscape could look like. We anticipate more projects will be added as we socialize this landscape throughout the community.
If this discussion would be better in a PR, we'd be happy to submit the changes that would be necessary and we can have the discussion there.
Rationale
The conversation around Batch Schedulers in the context of cloud and Kubernetes has been a complicated one over the last couple of years. As AI/ML continues to dominate discussions, the desire for solutions in this space has amplified. However, we find that people who want to solve this particular challenge often don't know where to start and don't know that there are existing options available.
As a result, companies often create their own bespoke solutions. Just about every KubeCon, another company announces that they are planning to open-source their new Batch Scheduler, often with extremely similar properties to the existing solutions. We'd much prefer to guide people to join forces on the existing solutions, ideally contributing to the conversations ongoing in the Kubernetes Batch Working Group (a sister working group the CNCF group working on k8s-specific issues) around Kueue and improving the core of Kubernetes to be more Batch Scheduling-friendly.
We think adding a landscape for Batch Scheduling could help bring awareness to the community that potential solutions already exist and that they have a place to start from.
We don't intend for the landscape to answer every question people have about Batch Scheduling on Kubernetes. Much like the vast CNCF landscape itself, it will be a starting point for people to work from and do their own diligence on what will work for them.
We don't relish bringing more complexity to an already overwhelming array of options on the existing landscape (and we really appreciate the recent improvements and simplifications in the recent update). However, there did not seem to be any meaningful way of describing the current landscape of Batch Schedulers within the context of the larger landscape. We are open to ideas, of course, which is why we're reaching out for discussion.