Job metascheduling across heterogenous clusters and resource types

natefoo commented 6 years ago

I have mulled this over myself quite a bit and with other admins, and am throwing it out here for wider discussion. As far as I know, I'm the only person in this situation, but maybe there are others who could benefit from a solution to this problem.

Galaxy currently makes it fairly easy to schedule across multiple types of compute resources, from local clusters, remote HPC, and cloud instances. However, it lacks certain metascheduling functionality that make distributing jobs in an automated way somewhat difficult.

The scheduling dilemma

Consider the following scenario, which currently exists on usegalaxy.org:

Multicore jobs are currently able to run on 3 different, but partially connected clusters:
1. A local bare metal cluster, where datasets are on a filesystem shared between the Galaxy server VMs and the compute nodes.
2. A remote OpenStack-based cluster with Pulsar and an NFS server for the nodes running on the "head node"
3. A largely identical copy of 2. on a different OpenStack cloud.
All 3 clusters run Slurm and are interconnected using Slurm's multi-cluster functionality, but they do not share filesystems amongst each other
Jobs can be directly submitted to any of the 3 clusters via Slurm on the Galaxy server VMs
Jobs sent to cluster 1 do not require staging, jobs sent to 2 and 3 do

The desire is to distribute jobs across these 3 clusters balanced by load. Galaxy currently makes this very easy to do (with static or dynamic mappings) by tool, user, datasets, etc., but not by load of the compute resource. Ideally I could just submit these to Slurm and let it make that decision. Indeed, Slurm has the ability to do this, but in practice, for Galaxy, it is not possible: to submit on clusters 2 and 3, the inputs must be staged (by Pulsar) and the job script is written differently (by Pulsar).

In light of this problem, we currently "balance" jobs across these 3 clusters with an increasingly ugly dynamic rule that submits a "test job" to Slurm with sbatch's --test-only flag to see which cluster Slurm would select, and then submitting the real job to the appropriate destination.

But there's a problem here: for Pulsar jobs, the job is only added to the Slurm queue after all of the data is staged, which could be quite a long time. In the meantime, Slurm still sees that cluster as being relatively unloaded and sbatch --test-only continues to indicate that jobs should be sent to that cluster. With enough jobs in rapid succession, a single cluster can quickly become overloaded in a very "unbalanced" way.

The staging dilemma

A potentially easy fix for this is to move the staging and the job script generation to the job itself. e.g., a job script becomes something like:

curl -o inputs/dataset_5678.dat https://usegalaxy.org/job_files/dataset?job_id=1234&dataset_id=5678&job_files_api_key=deadbeef
curl https://usegalaxy.org/job_files/job_script?job_id=1234&job_files_api_key=deadbeef | exec $SHELL

This would allow us to schedule the job immediately and not worry about where it's going. The dilemma is that it wastes a job's walltime staging data.

To mitigate this, we could use a middle-ground solution, where the job is submitted immediately, Pulsar stages the data, and then writes a file indicating when staging is complete (actually, I think it already does that part). The job may sit in the queue for part of this period, but when it runs it could just spin (with some configured timeout) on the completion of staging. After timeout, it could be rescheduled by Pulsar. This allows the DRM (Slurm) to know how busy the cluster is without wasting too much time on jobs waiting for staging.

Reservations

I'm not super excited about this idea:

I don't like that it's Slurm-specific (in that it requires you to have a multi-cluster Slurm setup, which many people will not have). HTCondor Flocking might be a bit more flexible for non-Slurm folks (even for non-HTCondor folks).
I've considered that maybe Mesos or Kubernetes could address this, but I don't know much about either and their documentation doesn't really make it easy to figure it out. If anyone with experience with either of those systems can comment, that'd be appreciated.
I do not want to write a complex metascheduler in Galaxy. These tools already exist and probably do it better than Galaxy. But I am simply unsure whether any of them can be effectively utilized for our purposes

hexylena commented 6 years ago

@natefoo and I spent some time discussing this last night, just wanted to distill that conversation/record it somewhere public and discoverable.

we looked at the wiki page for metaschedulers, most are closed source or long since dead.
k8s stuff is way too bound to containers to be useful for us, but there were some interesting notes here
we looked at mesos a bit
- There is a lot of stuff there, but not a lot of clear documentation of "use this product for this"
- As far as we can tell:
- Mesos provides the orchestration of resources, collecting advertisments (like condor) that there is a machine with N ram and M cpus available. So basically condor, except bigger. Also the clients that collect information it seems some people use different clients with modified behaviour for HPCs
- Marathon is a scheduler on top of it for containers
- DC/OS provides all of this wrapped up into one platform that you install.
- found airavata:
  
  "Scientific gateway developers use Airavata as their middleware layer between job submissions and grid systems. Airavata supports long running applications and workflows on distributed computational resources.
  
  but decided it wasn't so useful

And have a nice picture:

natefoo commented 6 years ago

Thanks @erasche!

To add to this, one of the difficulties regarding - what would you call them, data center abstraction and orchestration technologies? - is that they are mainly designed for service architectures, whereas Galaxy has a very HPC/HTC flow - tons of temporal jobs working on independent datasets and not part of a balanced service backend. Most of these technologies don't seem to have a strong notion of controlled/rule-based scheduling beyond simple resource requirements.

Mesos, however, seems like it's less opinionated than most and so potentially a good fit, but if (as @erasche explained) we're understanding things correctly, it's just the piece that connects and provides an access mechanism for resources, it's not the thing that makes the decisions - that thing is the framework - and there aren't frameworks specifically geared toward HPC workloads. Additionally, the comment from the defunct mesos-hpc-framework linked by @erasche is concerning regarding what success we may have in getting Mesos to do what we need:

This project is deprecated. The mesos offer/subscription mechanism doesn't work well dealing with workload managers. You should use our cloudify HPC plugin instead.

I've made even less sense of Cloudify so I have no idea if it's a potential solution. It seems oriented at controllable virtualization, but we need a metascheduler that understands traditional HPC (with DRMs), non-cloud hosts (virtualized or metal but not controllable), and clouds.

And at least for me, I need a fair amount of control over scheduling decisions. I left this out of my original post but in addition to the compute we control (where we could forego the DRM entirely if the metascheduler provided a better solution), we also need to schedule on supercomputers we don't control, currently Stampede2 (TACC) and Bridges (PSC).

Bridges can be left out of the metascheduler entirely if needed since almost all the tools that we run on it, we run exclusively on it, although we do some ugly metascheduling for rnastar.

Stampede2, on the other hand, should be eligible to run most (all?) of the multicore jobs (but should be considered lower priority) and is not something we can run agents on since execution is entirely through a DRM (Slurm).

So we have a slightly greater understanding of some things in the space but not really a better idea on whether there's a solution. Any Mesos experts who can correct some of these thoughts or give some insight on whether it's worth pursuing would be greatly appreciated.

nsoranzo commented 6 years ago

Maybe @saxtouri can help with Mesos questions?

abdulrahmanazab commented 6 years ago

@natefoo @erasche About meta-schedulers that are there. The German UNICORE (developed at jülich) is Java based, open source, suitable for windows and Linux, and has been the official meta-scheduler in PRACE for years. The CERN ARC Control Tower aCT (part of the NorduGrid project) is mainly Python based, open source, suitable for Linux, and has been (and still) the official meta-scheduler for CERN jobs and EGI. UNICORE is mainly designed for HPC jobs while aCT is for HTC jobs. The release manager of ARC is next door (in case you guys want to discuss something e.g. features), and we've been working on supporting container jobs (docker and singularity) and the ARC guys has a nice pilot for supporting meta-scheduling to HTCondor and others

natefoo commented 6 years ago

Thanks @abdulrahmanazab!

I'll look in to these. Is ARC online somewhere? All I'm finding are papers and slides.

abdulrahmanazab commented 6 years ago

@natefoo The ARC release manager is @maikenp and she is interested in supporting

natefoo commented 8 months ago

I do not want to write a complex metascheduler in Galaxy. These tools already exist and probably do it better than Galaxy. But I am simply unsure whether any of them can be effectively utilized for our purposes

Thankfully @nuwang did this, total perspective vortex suits all of my metascheduling needs.

galaxyproject / galaxy