cubed-dev / cubed

Bounded-memory serverless distributed N-dimensional array processing
https://cubed-dev.github.io/cubed/
Apache License 2.0
122 stars 14 forks source link

Relieve users of cluster management on HPC? #557

Open TomNicholas opened 3 months ago

TomNicholas commented 3 months ago

In #554 and #555 @applio started to add support for running Cubed on Dragon, with the intention of allowing users to run Cubed on HPC.

One thing that's super nice about the way Cubed runs on serverless cloud executors is that the user no longer has to think about the concept of a "cluster" at all. From the blog post:

No cluster to manage: A serverless design means that the user does not having to deploy and manage a cluster at all. Arguably conceptually simpler, this model also means less boilerplate code, no error-prone deployment step, and only paying for computation you actually do, not for the time the cluster is up.

Can we get a similar user experience on HPC somehow? The challenge is that HPC resources are normally controlled by a queuing system like SLURM or PBS, and are therefore not ephemeral in the same way that serverless functions are.

There are at least 2 ways in which I would expect users to want to run Cubed on HPC:

  1. Submitting a python script as a job
  2. From an interactive node (i.e. in a jupyter notebook)

We also don't want the users to have to think about exact configuration details on an allocation, e.g. how many threads/processes should be created on a particular system. Even if they do have to choose the size of the resource allocation manually, ideally Dragon would automatically create a sensible number of processes for them.