Better clarification of the layers, how they compose, who will use them

SteVwonder commented 3 years ago

In the Tuesday call, there was a big call to "jump straight to Layer 2" and skip over layer 0. This makes complete sense from a user perspective. Pilot jobs that avoid clogging queues, have dynamicity, and are generally more performant are (probably) the future, so why bother building, exposing, and maintaining a layer 0?

First, users will simple use cases may opt to use layer 0. Second, while most users may only use layer 2 APIs, any layer 2 implementation will certainly take advantage of the layer 0 APIs. This distinct between the layers allows for extra composability (e.g., you can use Radical Pilot at layer 2 and then swap in Flux for Slurm at layer 0 without affecting any layer 2 user code). This also means less work for layer 2 implementations.

Based on the feedback we received, we don't think the composability of the layers came across, and better explaining this would benefit the document. Maybe even a figure showing the "plug-and-play"ness of the various layers would be instructive.

Larofeticus commented 3 years ago

The observation is that if the existence of a "job" (in the LRM sense) is abstracted away from the user, then it is trivially easy to satisfy both layer 0 and layer 2 with one uniform interface.

Model of roles and responsibilities from the user (domain scientist or workflow tool developer) perspective: JPSI: "Give me your tasks and I will run them for you at the site."

Under the hood, a JPSI implementation might choose to submit every task as a job to the LRM, run everything in pilot jobs, a mixture, run the tasks on the login node, send some of the tasks to a cloud, or whatever else. And those distinctions would be by default abstracted away from the user, who only cares about their own tasks. That reduction in cognitive load is the value this project would add.

Or maybe JPSI should accept responsibility for a different role: JPSI: "Write a job spec in my interface and it can run on all of those LRMs."

In that case everything is layer 0 and the responsibility of choosing job packing is clearly delegated to the user.

hategan commented 3 years ago

The observation is that if the existence of a "job" (in the LRM sense) is abstracted away from the user, then it is trivially easy to satisfy both layer 0 and layer 2 with one uniform interface.

That's the plan. While not stated in the document at the moment, the API additions to Layer 2 are mostly about how things would run rather than how to describe what is to run. In other words, one needs to configure the pilot jobs (or the subsystem that manages them), whereas the actual user jobs can use the Layer 0 API.

That said, for a workflow system implementer, there are subtle differences. One is that there is no queuing time once pilot jobs are running. If you are using multiple clusters, this becomes very relevant. In order to load-balance properly, you have two basic choices: replicate or don't replicate (or mix them to some extent). Say you have two clusters. Replicate means you submit copies of the jobs to both and see which of the replicas manages to finish faster or which of the replicas clears the queue faster. If you don't replicate, you want to avoid submitting a large number of jobs to a cluster that may never run them, so, without some queue time prediction, you basically have to probe the queues by submitting a limited number of jobs to each and only committing more to a system once you learn that the system can actually run them in useful time.

That's somewhat less relevant with a pilot system, since most of that logic goes into managing pilot jobs rather than deciding how to schedule user jobs. You can basically wait until pilot jobs start and then you know exactly how many cores you can use to run the user jobs.

File staging is another thing that a pilot job system can change in a subtle way. In Layer 1, if data relies on a client machine, you would typically stage to a head-node FS and then either assume that said FS is shared (accessible by CNs) or use LRM staging. Random access from CNs on shared FSs is usually a poor performer since shared FSs have to make certain consistency guarantees. So you may have to end up staging certain files to local CN storage. With a pilot-job system, you can jump directly from client side to CN without using a shared FS. This, of course, only works if the job doesn't need some weird RW access to multiple regions in the same file.

So, the bottom line is, when somebody says "jump straight to layer 2", that needs to be qualified a bit better. For most purposes, Layer 0 and 2 are the same. From that perspective, the jumping is baked-in, so there is not much jumping left to be done. From an implementation perspective, we need a Layer 0 to implement a Layer 2, so it's hard to see how one could jump there. And, as far as the subtle differences between how the Layers 0, 1, and 2 would be used, I could say from personal experience that, as a user of a job API, I would have loved to be able to jump straight to Layer 2. Unfortunately we, in our role as API designers/implementers, we do not have this luxury and, while we could probably skip Layer 1, we still need a Layer 0.

andre-merzky commented 3 years ago

JPSI: "Give me your tasks and I will run them for you at the site." Under the hood, a JPSI implementation might choose ...

I'm late to the party, but want to add an opinion anyway :-)

I like the approach and the abstraction, and I can see this as extremely useful to the end user. Lowering cognitive load this way is great. Having said that, I don't think that JPSI is the place to specify or implement this. We currently scope the API (roughly) to be an abstraction layer above local (layer 0), remote (1) or nested (2) job management / batch systems. As such, I see the API to be too low level to make reasonable decisions about job scheduling and placement policies, which in the above case also involves reasoning about permissions and allocations, matching task requirements to hard- and software capabilities, etc. Burying that amount of semantics which spans high level end user requirements to near-system level implementation in a single library would result in, well, a mess. In my opinion, that type of functionality should live one or two layers above JPSI, in a workflow or campaign manager.

andre-merzky commented 3 years ago

From an implementation perspective, we need a Layer 0 to implement a Layer 2

I specifically agree on that part, and consider it sufficient justification for the existence of layer 0 (and 1).

ExaWorks / job-api-spec

Better clarification of the layers, how they compose, who will use them #60