Capture/Define storage use cases in the jobspec RFC

dongahn commented 8 years ago

Creating an issue ticket to capture some of the storage use cases in our jobspec RFC. Once we develop good understandings on this topic, we should create a PR. Moving the existing discussion thread from PR #52 to below.

dongahn commented 8 years ago

From @SteVwonder:

For a simple storage use-case, maybe something similar to the request for a specific amount of memory (with the exact location of the disk resources (i.e., burst buffer (BB) & parallel file system (PFS)) depending on their locations within the resource hierarchy). In this example, I assume that the PFS is the root of the resource tree and the on-node BBs are located under the compute nodes, as siblings to the on-node memory.

version: 1
walltime: 1h
resources:
  - type: parallel-file-system
    amount: 1TB
    with:
    - type: node
      with:
        - type: memory
          amount: 4GB
        - type: burst-buffer
          amount: 200GB

Some questions I had while writing this example:

Can types have spaces in the name (e.g., parallel file system)?
How would you simultaneously request capacity and bandwidth on a PFS/BB?
If a center has multiple Lustre file systems and multiple GPFS file systems, how would the user specify that they want one type of PFS over the other (e.g., prefer Luster over GPFS) without limiting the jobspec to a single PFS or explicitly listing all of the Lustre filesystems as potential matches (e.g, how to avoid writing something like lscratcha OR lscratchb OR lscratchc OR lscratchd OR lscratche OR lscratchf)?

Also, just a note on a more advanced storage use-case (this doesn't need to be solved immediately, but I figured I would bring it up since it will be an important use-case in the future for not only storage but also network bandwidth and power). After speaking with @dongahn the other day about burst buffer systems, it would be nice to have the ability to specify a job that has X GBs of space in the BB for staging-in data from the parallel filesystem and then X+Y GBs of space in the BB for actually running the user's app (since additional space will be needed for new checkpoints, etc). When I thought about how to write this in the current jobspec, three ideas came to mind.

This information could be provided in the properties.user dictionary of a jobspec as two separate values, one for stage-in and one for execution. A "BB-aware" scheduler could explicitly look for these values in the properties.user dictionary and then schedule the job according to some special BB logic. To me, this is a simple, straightforward way to do it.
The initial jobspec could request only X GBs of space, and then once the stage-in is complete, the job/user app could request additional space on the BB as needed via the dynamic scheduling interface. The main problem here is that the job is never guaranteed to actually receive any additional capacity.
The jobspec could add support for time as a dimension in the count/amount fields. Then malleable/evolving schedulers could take advantage of this additional knowledge, and rigid schedulers could just use the max of the count/amount fields.

dongahn commented 8 years ago

From @dongahn:

Existing BB implementations and high level considerations

Given the importance of the burst buffer resources for future systems, I propose our canonical jobspec covers some of the basic burst buffer use cases. Here are a few criteria I think we should consider. Our canonical jobspec should be able to

express burst buffer requirements for at least two different emerging implementations and various storage access modes of them:
- Node-local NVRAM (our Sierra) implementation with node visibility
- Intermediate BB nodes (Cray's DataWarp) with stripped access mode (job-level BB filesystem mount and visibility)
- Intermediate BB nodes (Cray's DataWarp) with private access mode (node-level mount/visibility)
express requirements for BB capacity, file sets that need to be staged in and out, and bandwidth
express the capacity requirements in a multi-staged fashion, the information essential for the scheduler to overlap the job's stage-in phase with the currently running jobs on the target nodes

Expressing a BB location in a jobspec's resource hierarchy seems natural

It seems we can do item 1 quite easily with the current jobspec

For node-local implementation and BB nodes implementation in private mode:

resource:
  - type: node
    count:
        min: 4
    with:
      - type: bb

For BB nodes implementation in striped mode (all of the allocated nodes commonly access the BB file system instance):

resource:
  - type: node
    count:
        min: 4
  - type: bb

Capacity and staging fileset specification

fileset as resources

Capacity, staging file sets, and bandwidth requirements can use more discussions. First to describe the staging requirements, my first attempt is to treat the staging file sets as the resources and hang them under the burst buffer resource

resource:
  - type: node
    count:
        min: 4
    with:
      - type: bb
        with:
          - type: stage-in
            files: 
              - /lscratchb/user1/cr2.$rank
              - /lscratchb/user1/cr3.$rank
          - type: state-out
            files: 
              - /lscratchb/user1/cr100.$rank
              - /lscratchb/user1/cr101.$rank

$rank or some built-in variables would be needed for the spec to express per-node files to stage. I used files to as the key to describe the staging files as a list of files. But one can also introduce dirs and flistfile as alternative ways to describe the staging files.

For BB nodes implementation in striped mode, all of the files that the job as a whole can be described in the same manner.

Capacity spec

To describe the capacity requirements, we at least need to specify two quantifies: the storage requirement for the stage-in phase and the running phase. The former is typically smaller in size than the latter, and hence this should help the scheduler to overlap the stage-in phase with the currently running jobs' computation or stage-out phase. My first attempt, which I don't yet like:

resource:
  - type: node
    count:
        min: 4
    with:
      - type: bb
        count:
            min: 20GB
        with:
          - type: stage-in
             count:
                min: 10GB
             files:
                -  /lscratchb/user1/cr2.$rank

I can imagine we can also:

resource:
  - type: node
    count:
        min: 4
    with:
      - type: bb
        count:
            min: 10GB
            max: 20GB
        with:
          - type: stage-in
             files:
                -  /lscratchb/user1/cr2.$rank

But my mind doesn't easily translate this as 10GB for stage-in and 20GB for running.

Bandwidth spec

For bandwidth requirements, I think for now we only need to be able to specify the bandwidth between the BB layer and the PFS. (We treat CN to BB as a highway)

But even then, I'm not sure how to express this yet. I have to think my trouble is because I don't fully understand how general graph works in our jobspec yet. Perhaps we can use this as an example to show how to specify resource requirement in a general graph.

In any case, I have to think that IO-bandwidth awareness can help increase resource utilization when we schedule jobs to overlap jobs' stage-in operations with the existing jobs. BB-PFS bandwidth awareness can allow our scheduler to delay stage-in operations as late as possible and help avoid early node locking of jobs.

dongahn commented 8 years ago

From @tws:

As to the storage use-case, I really want that to be a bit separate because I think that will be the first place where we really want to leverage a non-hardware-tree resource. I say that because I think a user should be able to express the amount of burst buffer that each task needs and have it able to match with any of the three examples Dong mentioned on the system side. I'm not sure we can do that with storage expressed at the full job level, but going from per-task to job level should work out pretty cleanly.

grondo commented 8 years ago

Thanks @dongahn!

I thought at some point we had discussed using multiple programs/jobs to handle stage-in/stage-out type facilities. E.g. in your example above, instead of a "stage-in" resource, we would have a stage-in job that would directly request the storage (and small amount of compute) needed to complete the stage-in. The compute portion of the job would then just be a dependency of the stage-in job. Similar for stage out.

I like this idea because if the job dependency system has enough functionality to make this work, then other workflow type jobs would be possible.

I haven't thought on this subject in detail like you have however, so there may be big problems with this approach, but I thought I'd throw it out there.

dongahn commented 8 years ago

Thanks @grondo:

Yes, this is more than a possibility. In fact, Greg Becker who is using a Trinity system told me that a job submission on this system results in 4 different sub-jobs: a stage-in job, a compute job and a storage job (for running phase) and a stage-out job, all with dependencies. So my guess is other RMs may be using a similar approach.

One thing to consider is, though, that this route will require higher complexity from the job dependency manager within our scheduler. Well... I am saying, complexity will have to be somewhere, but in this case it will be in the job dependency management:

the scheduler needs to be able to create a job dependency, not only for creating an order dependency, but also for creating a resource dependency. For example, for a node-local BB implementation, the scheduler needs to schedule the dependent job to the same compute nodes that the stage-in job used. I am not saying this is bad; in fact this will have other utilities for data intensive computing. But speaking with @lipari, today's job dependency management in MOAB and SLURM don't seem to have this.
the scheduler may also need to have logic to schedule the chain of dependent jobs to minimize their makespan. It doesn't make whole lot of sense to schedule the stage-in job too early as I tried to describe this morning. Again, speaking with @lipari, the dependency scheduling within MOAB and SLURM don't have this.
Finally, how does this work in realizing our ultimate goal of a job that grows/shrink -- a job may start with a stage-in job and grow in a running job and then may stay there as a storage job?

Whether we use a job dependency or canonical jobspec mechanism for staging, it feels like having concrete cases can help us to see if one or either mechanism can at least naturally express a staging requirement across emerging BB implementations. I vaguely use the following as the design space:

File Access/Sharing Requirement	Node-local BB	Job-global BB
A file per MPI process	Case 1.1	Case 1.2
Each file shared by a distinct subset of MPI processes	Case 2.1	Case 2.2
One file shared by all of the MPI processes	Case 3.1	Case 3.2

For the current the node-local BB implementation, I think case 2.1 will be limited in that a file is shared and accessed by only the MPI processes on the same node. And it doesn't support Case 3.1.

The main reason my first attempt was to specify fill sets as resources was it seemed quite easy to express all of these cases (except for Case 3.1.) If we use a job dependency mechanism, how will the stage-in job spec look like?

grondo commented 8 years ago

These are great points, and sorry I am coming at this without doing a deep dive yet as you have, so let me know if my comments are less than helpful.

My feeling at this point is that a rich dependency is one place where Flux could shine compared to other schedulers, so I'd like to continue investigating this approach. I don't think we should model our dependency system after Moab/Slurm.

One feature we have in Flux that could help here is dynamic tagging of resources. Something like that could be used to support user specific resource requests (I want to run on the same nodes as Job X), and could also be used in general for data locality support (once stage-in job starts, storage resources are tagged with job-specific information, and now compute job becomes eligible)

Alternately, an advanced scheduler could look at the whole "ensemble" or jobspec stream and would have the same information as if the stage-in/stage-out were specified within one job, and could schedule the jobs as a unit.

I also had the same idea as you that a jobspec stream could be used to represent an evolving job, if the changing resources could be represented as a disjoint function of time. This is not really different from treating each time unit with static resources as a separate program/job (only tasks do not start/end in between).

In fact we had talked about keeping the canonical jobspec with the job record for provenance, but this doesn't make sense for jobs that evolved over time. A stream of jobspecs with timestamps might be the easiest way to actually accomplish this record. (Perhaps we could even come up with a jobspec delta format to avoid duplicating the majority of the spec at each stage, think of each job change as similar to a git commit ;-) )

Sorry, I think my points are only tangentially related here, and I'm getting off topic. But, it does feel like there is some small chance that we could unify stage-in/out, workflow, complex dependencies, and evolving jobs into one central feature-set, which seems powerful.

I could be building a pipe dream though...

dongahn commented 8 years ago

Starting from the part I liked the most :-)

Sorry, I think my points are only tangentially related here, and I'm getting off topic. But, it does feel like there is some small chance that we could unify stage-in/out, workflow, complex dependencies, and evolving jobs into one central feature-set, which seems powerful.

My gut is telling me there is indeed a good chance for a unified model. Perhaps, this should be "EXPLORED" in a small/researchy prototype fashion :-) More comments in lined.

My feeling at this point is that a rich dependency is one place where Flux could shine compared to other schedulers, so I'd like to continue investigating this approach. I don't think we should model our dependency system after Moab/Slurm.

Completely agreed. I was just mentioning those to indicate we may explore a new territory here.

One feature we have in Flux that could help here is dynamic tagging of resources. Something like that could be used to support user specific resource requests (I want to run on the same nodes as Job X), and could also be used in general for data locality support (once stage-in job starts, storage resources are tagged with job-specific information, and now compute job becomes eligible)

Yes, this will have enormous utility as we're headed to data constraint computing. We want to schedule jobs to where data may still reside.

Alternately, an advanced scheduler could look at the whole "ensemble" or jobspec stream and would have the same information as if the stage-in/stage-out were specified within one job, and could schedule the jobs as a unit.

I also had the same idea as you that a jobspec stream could be used to represent an evolving job, if the changing resources could be represented as a disjoint function of time. This is not really different from treating each time unit with static resources as a separate program/job (only tasks do not start/end in between).

I think we can treat the burst buffer case as a specialized case of this and use as our driver toward generalized support. Its storage requirement evolves, but only at well-known schedule time (stage-in point, run start point, and stage-out point). This is a bit different than true elastic support where the schedule points are rather dynamically determined at runtime. But this is another dimension to dynamicity support where Flux can shine.

In fact we had talked about keeping the canonical jobspec with the job record for provenance, but this doesn't make sense for jobs that evolved over time. A stream of jobspecs with timestamps might be the easiest way to actually accomplish this record. (Perhaps we could even come up with a jobspec delta format to avoid duplicating the majority of the spec at each stage, think of each job change as similar to a git commit ;-) )

Maybe, jobspec can also be generated by a job itself in the case where resource requirements are dynamically changing. Another reason for the delta format.

Popping back from this a bit, do you want to make this use case a bit larger and talk about things in general terms or use BB cases as the driver for spacing things out and see how other use cases can map to and iterate? :-)

grondo commented 8 years ago

sorry, yeah we should focus on some concrete, near term use cases here and get them into the RFC, the larger issues can be in the back of our mind for a bit...

dongahn commented 8 years ago

@grondo, this discussion was great because this allows me to think about this use case from the perspective of later generazation.

Would it be possible to have a couple of concrete jobspecs to see what it might look like if we describe staging as dependent jobs? I can attempt but I will likely mess up since i didnt do a deep dive into the job dependency discussions we might have had.

lipari commented 8 years ago

As I read through this discussion, I jumped to the same discoveries you did: that scheduling a bb is just a generalized case of scheduling a series of resource needs over time, similar to growing/shrinking node resource requirements. If our generalized job spec can incorporate a series of resource need changes at specific times into the job's run time, then bb support would be covered. The scheduler's focus would have to be time-based (like the current sched_backfill plugin), and it would search for availability of various resources at various points in time - subject to rules of how much delay would be acceptable. (e.g., schedule this job even though it will have to wait 9 minutes at two hours into the run for a certain resource to be available).

dongahn commented 8 years ago

From discussions w/ Ewa Deelman during her visit and afterwards, my impression was there should be many scheduling algorithms we can borrow from the area of scientific workflow scheduling for this. I talked w/ @SteVwonder over lunch today, and we'll see if we can add a few research items he can explored in this area since I do think IO-aware bandwidth awareness can lead to a better schedule. Still, our short term goal for this this ticket remains the same.

dongahn commented 7 years ago

This needs to be revisited, keeping this open.

flux-framework / rfc