flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.82k stars 660 forks source link

[Core feature] Slurm agent #5634

Open BerndDoser opened 3 months ago

BerndDoser commented 3 months ago

Motivation: Why do you think this is important?

Slurm is a widely used workload management system in many HPC (High-Performance Computing) compute clusters. It plays a vital role in efficiently allocating compute resources, running work on these allocated resources, and overseeing a queue of pending work.

Integrating Slurm with Flyte presents a strategic opportunity to greatly enhance the utilization of Flyte workflows on HPC clusters, providing many advantages.

Goal: What should the final outcome look like, ideally?

The aim is the implementation of a Flyte agent that facilitates task submission to HPC resources using the Slurm scheduler.

Typically, users interact with Slurm through its command-line interface (CLI). For instance, the sbatch command submits a job script for later execution. An optional Slurm daemon also offers a REST API for interacting with the Slurm system.

Describe alternatives you've considered

I don't know of anything comparable.

Propose: Link/Inline OR Additional context

I am available to offer support using Slurm and to test the Flyte agent. https://github.com/JBris/slurm-rest-api-docker can be used for testing the Slurm CLI and the Slurm REST API.

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 3 months ago

Thank you for opening your first issue here! 🛠

runllm[bot] commented 3 months ago

To integrate Slurm with Flyte, you can create a Flyte agent that interacts with the Slurm scheduler using its REST API. This approach leverages Flyte's extensibility and allows for efficient task submission to HPC resources managed by Slurm.

Motivation: Why do you think this is important?

Slurm is a widely used workload management system in many HPC (High-Performance Computing) compute clusters. It plays a vital role in efficiently allocating compute resources, running work on these allocated resources, and overseeing a queue of pending work.

Integrating Slurm with Flyte presents a strategic opportunity to greatly enhance the utilization of Flyte workflows on HPC clusters, providing many advantages.

Goal: What should the final outcome look like, ideally?

The aim is the implementation of a Flyte agent that facilitates task submission to HPC resources using the Slurm scheduler.

Typically, users interact with Slurm through its command-line interface (CLI). For instance, the sbatch command submits a job script for later execution. An optional Slurm daemon also offers a REST API for interacting with the Slurm system.

Describe alternatives you've considered

I don't know of anything comparable.

Propose: Link/Inline OR Additional context

I am available to offer support using Slurm and to test the Flyte agent. https://github.com/JBris/slurm-rest-api-docker can be used for testing the Slurm CLI and the Slurm REST API.

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

eapolinario commented 3 months ago

@BerndDoser , would this be something that you'd be interested to collaborate? The guide on how to develop a new agent has a lot of details, including how to develop agents locally (so that the dev cycle is tight).

BerndDoser commented 3 months ago

Hi @eapolinario, thank you for your interest in the feature request. I am generally interested in collaborating. For example, I can set up a Slurm container to test such a feature. I am currently on vacation but will be back at the beginning of September.

kumare3 commented 2 months ago

Please let us know when you try.