rfc14: support for tiered IO storage

SteVwonder commented 5 years ago

The three main use-cases for tiered IO (that we see so far) are:

Cache
- Acts like an extended OS page cache for a given job or parallel filesystem
- Sample User input:
- Storage Capacity
- Striping config
- Flushing strategy
Scratch
- functions like a /tmp file system for the given node, job, or workflow
- Sample User input:
- (Same as cache)
- Granularity (per-node, per-job, or for the whole workflow)
Staging
- all data movement occurs explicitly for a given job (before and/or after it runs)
- Sample User Input:
- Storage Capacity
- File(s) or directory to stage in/out

We want to make sure we can express all three of these use-cases in the jobspec. With existing systems, caching and scratch are mutually exclusive with each other but compatible with staging; that may not always be the case.

Idea 1

We could take a similar approach to how it is done already (Slurm @ NERSC link) by sticking all of this information under attributes.system.tiered-io (or similar). For example:

resources:
  - type: node
    count: 4
    with:
      - type: slot
        count: 1
        label: default
        with:
          - type: core
            count: 2
attributes:
  storage:
    capacity: 4 TB
    mode: scratch
    granularity: per-node
    stage-in:
       directory: /path/to/PFS

I think this method starts to break down once you have multiple tiers of storage. We would have to embed locality information into the attributes section, which seems wrong.

Idea 2

I think a better idea would be to lift some of this information (i.e., capacity and locality) into the resource section: To avoid adding all of the information to the resource vertices, we can add labels on the resources and reference them in the attributes section:

resources:
  - type: node
    count: 4
    with:
      - type: slot
        count: 1
        label: default
        with:
          - type: core
            count: 2
      - type: storage
        count: 1
        unit: terabytes
        label: node-local-scratch
  - type: storage
    count: 4
    unit: terabytes
    label: PFS-cache
attributes:
  storage:
    - label: node-local-scratch
      mode: scratch
      granularity: per-node
      stage-in:
         list: /path/to/stage-in-listing
    - label: PFS-cache
      data-layout: striped
      mode: cache
      stage-in:
        directory: /path/to/PFS

This second idea could be really powerful for hybrid architectures where there are SSDs in multiple storage tiers, each of which a job may want to configure differently. It gives us lots of flexibility and allows us to support hybrid architectures without requiring any modifications to the existing canonical jobpsec. It also provides a separation of concerns, the resource info that the scheduler cares about is up in the resources section and the configuration/staging info that the IMP/job-shell cares about is down in the attributes section.

Any thoughts?

SteVwonder commented 5 years ago

Notes from the group meeting:

How to handle the 4TB of storage requested not under a node/slot? Would any storage satisfy that?
Would be nice to extend resource-query to generate visualizations like ORNL's Jsrun visualizer

dongahn commented 5 years ago

Would be nice to extend resource-query to generate visualizations like ORNL's Jsrun visualizer

If there is already a visualizer that supports a reasonable input format and meets our requirement, this would be a matter of adding a new writer.

flux-framework / rfc

rfc14: support for tiered IO storage #204

Idea 1

Idea 2