NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
5 stars 4 forks source link

Reducing OST (and perhaps MDT) fragmentation #176

Open jameshcorbett opened 3 months ago

jameshcorbett commented 3 months ago

Servers resources for lustre, when filled in by Flux, can look something like this:

spec:
  allocationSets:
  - allocationSize: 824633720832
    label: ost
    storage:
    - allocationCount: 3
      name: elcap1
    - allocationCount: 1
      name: elcap2
  - allocationSize: 17179869184
    label: mdt
    storage:
    - allocationCount: 3
      name: elcap1
    - allocationCount: 1
      name: elcap2

What this gives us is 3 OSTs on elcap1 and 1 on elcap2. However, as @behlendorf noted,

For Lustre we realistically wouldn't want to ever create more than one OST or MDT per rabbit per workflow. It's good that HPEs software supports it since it'd be nice to experiment with, but it would be an odd configuration.

However I think there is a disconnect between the way Flux allocates storage and the the way Servers asks for the storage to be represented. At the moment Flux does not have any kind of policy to allocate equal amounts of storage from each rabbit. Flux may allocate a huge chunk of storage (let's say N bytes) from elcap1 and a much smaller amount of storage (M bytes) on elcap2 (as in the example above), with the desire of nevertheless having a single OST (and perhaps MDT) on each despite the size differences. But there is no good way for us to represent that in Servers without doing something like the above, in which we take the greatest common divisor of N and M, make that the allocationSize, and then multiply the allocationCount for each by N / GCD(N, M) and M / GCD(N,M) respectively.

@behlendorf also noted that imbalanced allocations may not be desirable, since

Lustre will do a better job of balancing their usage if [OSTs are] all close to the same capacity

So Flux may need to work on a policy to equalize the amount of storage on each rabbit node. However it might be nice if we could do something like the following:

spec:
  allocationSets:
    label: ost
    storage:
    - allocationCount: 1
      name: elcap1
      allocationSize: 2424633720832
    - allocationCount: 1
      name: elcap2
      allocationSize: 824633720832
jameshcorbett commented 3 months ago

Brian has indicated to me that OSTs of unequal sizes may be devastating to performance. So perhaps the problem is really in Flux's scheduling policy. I've opened https://github.com/flux-framework/flux-coral2/issues/175.