argonne-lcf / balsam

High throughput workflows and automation for HPC
77 stars 21 forks source link

A unified platform to manage high-throughput workflows across the HPC landscape.

Run Balsam on any laptop, cluster, or supercomputer.

$ pip install --pre balsam
$ balsam login
$ balsam site init my-site

site-init

Python class-based declaration of Apps and execution lifecycles.

from balsam.api import ApplicationDefinition

class Hello(ApplicationDefinition):
    site = "my-laptop"
    command_template = "echo hello {{ name }}"

    def handle_timeout(self):
        self.job.state = "RESTART_READY"

Seamless remote job management.

# On any machine with internet access...
from balsam.api import Job, BatchJob

# Create Jobs:
job = Job.objects.create(
    site_name="my-laptop",
    app_id="Hello",
    workdir="test/say-hello",
    parameters={"name": "world!"},
)

# Or allocate resources:
BatchJob.objects.create(
    site_id=job.site_id,
    num_nodes=1,
    wall_time_min=10,
    job_mode="serial",
    project="local",
    queue="local",
)

Dispatch Python Apps across heterogeneous resources from a single session.

import numpy as np

class MyApp(ApplicationDefinition):
    site = "theta-gpu"

    def run(self, vec):
        from mpi4py import MPI
        rank = MPI.COMM_WORLD.Get_rank()
        print("Hello from rank", rank)
        return np.linalg.norm(vec)

jobs = [
    MyApp.submit(
        workdir=f"test/{i}", 
        vec=np.random.rand(3), 
        ranks_per_node=4,
        gpus_per_rank=0,
    )
    for i in range(10)
]

for job in Job.objects.as_completed(jobs):
   print(job.workdir, job.result())

Features