Closed natefoo closed 8 months ago
@natefoo and I spent some time discussing this last night, just wanted to distill that conversation/record it somewhere public and discoverable.
we looked at mesos a bit
found airavata:
"Scientific gateway developers use Airavata as their middleware layer between job submissions and grid systems. Airavata supports long running applications and workflows on distributed computational resources.
And have a nice picture:
Thanks @erasche!
To add to this, one of the difficulties regarding - what would you call them, data center abstraction and orchestration technologies? - is that they are mainly designed for service architectures, whereas Galaxy has a very HPC/HTC flow - tons of temporal jobs working on independent datasets and not part of a balanced service backend. Most of these technologies don't seem to have a strong notion of controlled/rule-based scheduling beyond simple resource requirements.
Mesos, however, seems like it's less opinionated than most and so potentially a good fit, but if (as @erasche explained) we're understanding things correctly, it's just the piece that connects and provides an access mechanism for resources, it's not the thing that makes the decisions - that thing is the framework - and there aren't frameworks specifically geared toward HPC workloads. Additionally, the comment from the defunct mesos-hpc-framework linked by @erasche is concerning regarding what success we may have in getting Mesos to do what we need:
This project is deprecated. The mesos offer/subscription mechanism doesn't work well dealing with workload managers. You should use our cloudify HPC plugin instead.
I've made even less sense of Cloudify so I have no idea if it's a potential solution. It seems oriented at controllable virtualization, but we need a metascheduler that understands traditional HPC (with DRMs), non-cloud hosts (virtualized or metal but not controllable), and clouds.
And at least for me, I need a fair amount of control over scheduling decisions. I left this out of my original post but in addition to the compute we control (where we could forego the DRM entirely if the metascheduler provided a better solution), we also need to schedule on supercomputers we don't control, currently Stampede2 (TACC) and Bridges (PSC).
Bridges can be left out of the metascheduler entirely if needed since almost all the tools that we run on it, we run exclusively on it, although we do some ugly metascheduling for rnastar.
Stampede2, on the other hand, should be eligible to run most (all?) of the multicore jobs (but should be considered lower priority) and is not something we can run agents on since execution is entirely through a DRM (Slurm).
So we have a slightly greater understanding of some things in the space but not really a better idea on whether there's a solution. Any Mesos experts who can correct some of these thoughts or give some insight on whether it's worth pursuing would be greatly appreciated.
Maybe @saxtouri can help with Mesos questions?
@natefoo @erasche About meta-schedulers that are there. The German UNICORE (developed at jülich) is Java based, open source, suitable for windows and Linux, and has been the official meta-scheduler in PRACE for years. The CERN ARC Control Tower aCT (part of the NorduGrid project) is mainly Python based, open source, suitable for Linux, and has been (and still) the official meta-scheduler for CERN jobs and EGI. UNICORE is mainly designed for HPC jobs while aCT is for HTC jobs. The release manager of ARC is next door (in case you guys want to discuss something e.g. features), and we've been working on supporting container jobs (docker and singularity) and the ARC guys has a nice pilot for supporting meta-scheduling to HTCondor and others
Thanks @abdulrahmanazab!
I'll look in to these. Is ARC online somewhere? All I'm finding are papers and slides.
@natefoo The ARC release manager is @maikenp and she is interested in supporting
I do not want to write a complex metascheduler in Galaxy. These tools already exist and probably do it better than Galaxy. But I am simply unsure whether any of them can be effectively utilized for our purposes
Thankfully @nuwang did this, total perspective vortex suits all of my metascheduling needs.
I have mulled this over myself quite a bit and with other admins, and am throwing it out here for wider discussion. As far as I know, I'm the only person in this situation, but maybe there are others who could benefit from a solution to this problem.
Galaxy currently makes it fairly easy to schedule across multiple types of compute resources, from local clusters, remote HPC, and cloud instances. However, it lacks certain metascheduling functionality that make distributing jobs in an automated way somewhat difficult.
The scheduling dilemma
Consider the following scenario, which currently exists on usegalaxy.org:
The desire is to distribute jobs across these 3 clusters balanced by load. Galaxy currently makes this very easy to do (with static or dynamic mappings) by tool, user, datasets, etc., but not by load of the compute resource. Ideally I could just submit these to Slurm and let it make that decision. Indeed, Slurm has the ability to do this, but in practice, for Galaxy, it is not possible: to submit on clusters 2 and 3, the inputs must be staged (by Pulsar) and the job script is written differently (by Pulsar).
In light of this problem, we currently "balance" jobs across these 3 clusters with an increasingly ugly dynamic rule that submits a "test job" to Slurm with
sbatch
's--test-only
flag to see which cluster Slurm would select, and then submitting the real job to the appropriate destination.But there's a problem here: for Pulsar jobs, the job is only added to the Slurm queue after all of the data is staged, which could be quite a long time. In the meantime, Slurm still sees that cluster as being relatively unloaded and
sbatch --test-only
continues to indicate that jobs should be sent to that cluster. With enough jobs in rapid succession, a single cluster can quickly become overloaded in a very "unbalanced" way.The staging dilemma
A potentially easy fix for this is to move the staging and the job script generation to the job itself. e.g., a job script becomes something like:
This would allow us to schedule the job immediately and not worry about where it's going. The dilemma is that it wastes a job's walltime staging data.
To mitigate this, we could use a middle-ground solution, where the job is submitted immediately, Pulsar stages the data, and then writes a file indicating when staging is complete (actually, I think it already does that part). The job may sit in the queue for part of this period, but when it runs it could just spin (with some configured timeout) on the completion of staging. After timeout, it could be rescheduled by Pulsar. This allows the DRM (Slurm) to know how busy the cluster is without wasting too much time on jobs waiting for staging.
Reservations
I'm not super excited about this idea: