flux-framework / flux-coral2

Plugins and services for Flux on CORAL2 systems
GNU Lesser General Public License v3.0
9 stars 7 forks source link

Enable more flexible lustre scheduling #161

Open jameshcorbett opened 6 months ago

jameshcorbett commented 6 months ago

If #157 goes in and changes the layout of the resource graph, it will enable jobspecs that look like this:


version: 9999
resources:
  - type: ssd
    count: 50000
    exclusive: true
  - type: node
    count: 1
    exclusive: false
    with:
    - type: slot
      label: task
      count: 1
      with:
      - type: core
        count: 1
# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: task
    count:
      per_slot: 1

Which would allow fluxion to pick rabbit-ssds and nodes completely independently, which would be a perfect match for lustre file systems. However, if a job also asked for xfs or gfs2 in addition to lustre, I think the only option would be to fall back to forcing all storage to be rack-local.

To enable this, directivebreakdown.py would need to be updated to recognize lustre-only directives, and coral2_dws would need to inspect the JGF output from Fluxion after scheduling to see which rabbits were selected, rather than simply assuming (as it does now) that rabbits were chosen according to the nodes that were chosen.