cncf / tag-runtime

🏃🏿‍♀️🏃🏽‍♀️🏃🏻‍♂️🕒CNCF Technical Advisory Group for Runtime
https://tag-runtime.cncf.io
Apache License 2.0
83 stars 60 forks source link

Batch System Initiative (BSI) Working Group #38

Closed k82cn closed 2 years ago

k82cn commented 2 years ago

Recently, we're talking with different community on how to support batch workload in cloud native environment; and we found it's necessary to align different implementation with a specification, it'll be easier for the framework to do the integration, e.g. kubeflow.

So I'd like to propose a new working group for batch workload, and build related specification for the community, e.g. kubeflow community can use this specification to work with k8s, Volcano and Yunikorn, or event with Slurm, HTCondor :)

I'll draft a proposal with more detail on that working group; if any more comments, please let me know :)

raravena80 commented 2 years ago

@k82cn This is great! Looking forward to the first draft of the charter. In the past we have used Google docs for drafts but feel free to use what works best. Thanks! cc: @rochaporto @mrbobbytables @stackedsax @jimbobby5 @yuanchen8911

rochaporto commented 2 years ago

Definitely interested in helping out with this, we've talked about it multiple times in the past! Really nice!

rochaporto commented 2 years ago

Dropping here some efforts targeting also fair share and elastic capacity: https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/capacityscheduling

yuanchen8911 commented 2 years ago

Great! Thanks for keeping me in the loop. /cc @xujyan

stackedsax commented 2 years ago

I don't know if any of this old conversation is still relevant, but I thought I'd add it for historical purposes:

We were all so young and innocent....

raravena80 commented 2 years ago

@stackedsax thanks for sharing! (that was 2 1/2 years ago 😮)

denkensk commented 2 years ago

/cc

yuanchen8911 commented 2 years ago

/cc @Huang-Wei

yangwwei commented 2 years ago

Thank you @k82cn this is great. The requirements for the batch workloads are really similar, we should be able to abstract out a spec that aligns different projects together. I'd be happy to work with you and others on this.

k82cn commented 2 years ago

Thank you @k82cn this is great. The requirements for the batch workloads are really similar, we should be able to abstract out a spec that aligns different projects together. I'd be happy to work with you and others on this.

Great! Looking forward to work together on that :)

yuanchen8911 commented 2 years ago

Just wanted to throw out some high level thoughts.

Firstly, the solution needs to be Kubernetes native and extensible with

Common feature requirements

FYI, our recent KubeCon talk on batch support (/cc @denkensk)

/cc @xujyan @k82cn @Huang-Wei @rochaporto

yuanchen8911 commented 2 years ago

A Batch SIG or WG kubernetes/community#6263

Put aside the different ideas, shall we work together as a single virtual team to create a common forum/group for the topic? @ahg-g, @k82cn, @Huang-Wei , @rochaporto @raravena80

k82cn commented 2 years ago

A Batch SIG or WG kubernetes/community#6263

Put aside the different ideas, shall we work together as a single virtual team to create a common forum/group for the topic? @ahg-g, @k82cn, @Huang-Wei , @rochaporto @raravena80

CNCF WG is a good place to host such a virtual team cross the community, e.g. CNI

yangwwei commented 2 years ago

CNI/CSI are great examples. Thank you @k82cn . Can we have something like the following (not strictly to the name or format):

such job/queue definition can be backed by different schedulers, default, scheduler-plugin, Volcano, or YuniKorn. How to use these properties will be vary in different implementation, but essentially this gives enough "hint" for the scheduler to know how to better schedule a job. This will give a certain-consistency of the behavior for scheduling batch jobs on K8s.

k82cn commented 2 years ago

such job/queue definition can be backed by different schedulers, default, scheduler-plugin, Volcano, or YuniKorn. How to use these properties will be vary in different implementation, but essentially this gives enough "hint" for the scheduler to know how to better schedule a job.

Exactly! That's why prefer to have such a WG in CNCF instead of a individual community for batch API/Specification :)

wsxiaozhang commented 2 years ago

it's great to see different teams are coming to the similar category of batching. Some initial works have been public for a while as @yuanchen8911 and @yangwwei listed. Since different people (from HPC, AI/ML, Bid data, other large volume data processing area like simulation, genomics) may have different insight of batching, I'm really looking forward to make clear together what the concrete scope and target scenarios are for this.

denkensk commented 2 years ago

Thanks to @rochaporto and @yuanchen8911 for mentioning our work at Batch Scheduling and Management in https://github.com/kubernetes-sigs/scheduler-plugins and https://github.com/kube-queue/kube-queue .

I look forward to everyone who is interested in Batch Compute collaborating to promote the development of Batch on Kubernetes. Nice work @k82cn

Before we discuss the definition of interfaces and components, I hope we can clarify our goals and scope. It would be nicer if we could add some description of the benefits of our WG (like easier to integrate with other projects like Spark/Kubeflow.)

How to use these properties will be vary in different implementation, but essentially this gives enough "hint" for the scheduler to know how to better schedule a job.

@yangwwei I'm a little concerned about having enough expressiveness to be compatible with different architectural implementations. But this can be discussed later. ^_^

yuanchen8911 commented 2 years ago

Some additional comments in the other thread

https://github.com/kubernetes/community/issues/6263#issuecomment-990121461

k82cn commented 2 years ago

Some additional comments in the other thread

kubernetes/community#6263 (comment)

Thanks for the input. The WG in k/k should only focus on kubernetes, and a WG in CNCF will help to colaberate cross projects :)

dims commented 2 years ago

@k82cn Klaus, the key here is the folks in the CNCF WG should help figure out what we need to do in k/k actively based on the ideas/collaboration in CNCF. I hope that happens. There's a lot to be said about functionality that comes out of the box in k8s and the strength of conformance testing to ensure things work across k8s distributions.

k82cn commented 2 years ago

the CNCF WG should help figure out what we need to do in k/k actively based on the ideas/collaboration in CNCF

Definitely; one of major target of WG is to collaborate with related projects to clarify the scope and interface.

and here's a draft chart of the WG; if any more comments, please let me know :)

stackedsax commented 2 years ago

Thanks @k82cn.

I created a Slack channel and added it to the charter:

I tried to guess everyone's Slack handle, but @yuanchen8911 @denkensk @wsxiaozhang I don't know that I got you all correctly. Please jump in if I've missed you or invited the wrong person.

I also added a couple of items and comments to the charter. Thanks again!

yuanchen8911 commented 2 years ago

Thanks @k82cn.

I created a Slack channel and added it to the charter:

I tried to guess everyone's Slack handle, but @yuanchen8911 @denkensk @wsxiaozhang I don't know that I got you all correctly. Please jump in if I've missed you or invited the wrong person.

I also added a couple of items and comments to the charter. Thanks again!

Thanks, Alex!

denkensk commented 2 years ago

@stackedsax Thanks

haosdent commented 2 years ago

Thanks for the input. The WG in k/k should only focus on kubernetes, and a WG in CNCF will help to collaborate cross projects :)

Cool! +100 for the Specification like CNI.