flux-framework / flux-coral2

Plugins and services for Flux on CORAL2 systems
GNU Lesser General Public License v3.0
9 stars 7 forks source link

Rabbit allocations of equal or near-equal sizes #175

Open jameshcorbett opened 4 months ago

jameshcorbett commented 4 months ago

For ephemeral lustre file systems on rabbits, Flux can choose rabbit storage irrespective of the location of compute nodes. Flux can also split the allocation across multiple rabbits (and will need to depending on the size of the file system). However, if the allocation is split across multiple rabbits, @behlendorf has indicated that is a performance requirement that the allocations all be equal or near-equal in size.

I don't know at the moment how to accomplish this. A new Fluxion match policy? @milroy , @zekemorton, or @trws , any ideas?

trws commented 4 months ago

What's the requirement that's causing trouble currently? I can think of a few ways we can force fit this, but we may have to work at it a bit.

jameshcorbett commented 4 months ago

Currently we allocate rabbit storage local to the compute nodes we've chosen. So if Fluxion picks five nodes on rack A and one on rack B, it will also allocate five times as much storage on rabbit A as on rabbit B. [side note: we do it this way just because it works for XFS and GFS2 and Lustre, even though it's unnecessarily restrictive for Lustre. See #161]

If we try to set up Lustre OSTs on both rabbits in a case like that where the storage isn't evenly distributed, Brian has said that he expects performance to be "badly wrecked":

Lustre will attempt to evenly use the capacity [of the OSTs], if they're widely different in size then some will be much more heavily used than others. I'd expect that to pretty badly wreck performance.

e.g. if you have 2 OSTs, a 1TB and a 5TB. Then the 5TB will get 5x the IO sent to it.

I'm not sure I entirely understand this logic though so I'm going to check back in with Brian about it.

trws commented 4 months ago

That logic sounds right, though painful, since it's how uneven storage striping systems tend to work. You end up basically round-robining over however many stripes there are, which overloads a larger device in some cases (did this to myself with an uneven software parity setup once).

The simplest (though most annoying) thing I can think to do is to enact a version of the plan we talked about over the whiteboard a while ago and effectively "split" the rabbits into storage meant for NNL storage and lustre-type storage. Make half available in the nodes you're already using, and half available from another "meta-node" maybe just hanging off the cluster at the top that gets used for ephemeral lustre, and request an even amount from each rabbit involved. Unfortunately that would mean tracking how much of each rabbit's lustre storage is consumed some other way, which is kinda awful.

The more satisfying solution would be to have a way to express that a given resource type prefers even load, or greater distribution, or something such that we can actually get better behavior. The cheapest thing I can think of that's at least somewhat in this direction is we could try to optimize for choosing as many different rabbits as possible to service the request. You have 10 nodes and ask for lustre of at least 10TB? We try to give you slices of 10 rabbits. We don't have built-in support for that, but the hook we use to do "node-centric allocation" works similarly such that we might be able to get it with a relatively small tweak.