NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

Documentation on nodes with different storage capacity #176

Closed johnzielke closed 2 months ago

johnzielke commented 2 months ago

Describe the issue

How does AIS handle nodes with different amounts of storage? Will it balance the objects so that the disk usage is equal in % or will each node fill up equally in terms of total size and the cluster is therefore restricted to nodes with the same amount of storage? I was trying to find docs on this, but was unable to, only that the performance of drives and nodes is assumed to be similar.

Links

N/A

alex-aizman commented 2 months ago

The nodes are supposed to be identical, capacity-wise and otherwise. The term to look for is "rebalance":

## overall
$ git grep -n rebalance | wc -l
457

## and the docs only
$ git grep -n rebalance -- "*.md" | wc -l
179
johnzielke commented 2 months ago

Thank you very much. I just wasn't sure whether the fact that all nodes in the docs were setup with the same amount of storage was for simplicity or if this is a necessity. What would be your suggestion on how to deal with hardware upgrades in a cluster then? Setup a completely new one and swap out all nodes at once?

alex-aizman commented 2 months ago

There's no need to do any drastic "swapping-out" steps. AIS cluster will continue operating if you, for instance, add a machine with 16 drives to already present nodes that all have, say, 12. Scenarios abound but normally you should be able to perform gradual incremental upgrade.

When I say "normally" I mean avoiding corner cases when one or several nodes have less than 5% remaining usable capacity. Normally, you should use Grafana/Prometheus monitoring to report aistore alerts that in turn include OOS, and more.

But again, the only limitation is that aistore currently does not take into account that nodes may differ hardware-wise. That's all.

johnzielke commented 2 months ago

Thank you! Sorry my message was worded incorrectly, by setting up a new one I meant a complete rolling replacement of the nodes to be able to use higher capacity available in newer nodes.

Out of interest: While I understand that it is not currently supported nor planned, would the fundamental design and the HRW algorithm allow an implementation that would take the storage available on each node into account and distribute the data accordingly among these nodes? From my understanding, the biggest problem is posed by the erasure coding and the slices that need to be stored from it. This is at least what would prevent one from just running multiple ais targets on larger servers, since the parity data might be stored on the same node.

alex-aizman commented 2 months ago

The optimization objective is performance or throughput, to be precise, given limited capacity. It's one thing that an AIS target with lesser storage space will run out of it first. It's another thing that the slowest target will throttle the entire system. The idea to optimize across different hardwares is therefore not very appealing. Could be a good academic paper though, I can see that.