SLURM requirements - Githubissues

tcompa commented 1 year ago

While thinking about a new backend, not based on parsl, we should have a clear view of as many SLURM requirements as possible - including worst-case scenarios (e.g. "we'd like to process 1000 wells at the same time, without having 1000 SLURM jobs").

@gusqgm @jluethi , feel free to brainstorm about this and add more info to this issue

jluethi commented 1 year ago

@tcompa Yes, we're preparing an overview of all our ideas of worst-case scenarios and potential applications that we'll need to support that could influence the orchestrator choices :)

jluethi commented 1 year ago

We have discussed this topic further internally and came up with a set of questions and use cases to discuss before we commit to a change in the orchestration engine.

Full document is here, but I structured the thoughts a bit more for this issue: https://docs.google.com/document/d/1WkFZDCELvxXpmu091HwIOCissHnSUkfksJPfzKpccnU/edit#

Current status

As much as possible, we don’t want to invent our own frameworks for things where the open-source community has well-developed tools. The focus of Fractal is not a general workflow engine or an orchestrator. If we can use an existing orchestrator that is actively maintained, that is valuable. But more and more issues with some parts of Parsl have come up over the past months. Also, Parsl is very broad and does lots of things we don’t use. So our main issues with Parsl seem to be: We use basically nothing of the Parsl specific functionalities We keep hitting barriers during fractal development which forces us to create new Parsl issues - their response is so far present, but not optimal

Thus, the big question is:

Should we 1) Build our own backend 2) Continue with Parsl with our workarounds as we’ve done so far 3) Become more involved with Parsl, contribute to Parsl to make it work better for us 4) Choose a different orchestrator

Scale of workflows that are relevant

Submitting many jobs to the same slurm job. e.g. if 1000 wells exist, but only 50 are run in parallel for ~30s each, parsl would submit 50 slurm jobs and run the 1000 wells through them 50 at a time. A basic slurm backend may submit 1000 jobs (and 50 run at a time). Not much of a concern on the UZH side (TissueMaps works mostly like this), but a concern from FMI IT. 2 reasons FMI IT wants fewer slurm jobs, preferably jobs that run > 10 min and dozens of jobs per user, not 1000s: 1) Listing currently active job (that’s a taste/culture question) 2) They do accounting of who uses the cluster how much using the slurm submission database and we’ve been told this doesn’t scale too great with many tiny jobs Expected scales: Typically users may process anything from partial 96 well plates to full 384 well plates. Dozens to hundreds of wells, ~TB of images. This can fairly easily grow to 10s of TBs currently. We currently have single users carrying 100+Tbs of mostly HCS data due to the challenges imposed by their projects. Screening use-case: Some users may want to process dozens of such plates => current upper limit lies around the order of 10 000 wells, but this could be larger depending on the project at hand. Users may have 1 - ~ 40 acquisitions (multiplexing cycles). Each cycle is mostly processed on its own. Thus, that can scale as if we had more well. E.g. a plate with 100 wells & 40 cycles would have 4000 potential jobs being submitted

Submitting to different backends

Parsl should allow us relatively easily (how easy would this actually be?) to submit to things like AWS, Google Cloud and other providers (they list 15 providers here). If Fractal becomes more broadly used in 2023 and beyond, using it on different architectures will become important.

Monitoring

Probably not that relevant for the orchestration discussion, as we anyway don’t use the parsl monitoring for the time being.

Given these questions & concerns, could you discuss this in the team & we discuss at the Fractal call on Wednesday?

tcompa commented 1 year ago

No further action needed, closing for now.

fractal-analytics-platform / fractal-server

SLURM requirements #196

Current status

Thus, the big question is:

Scale of workflows that are relevant

Submitting to different backends

Monitoring