Open florence-crl opened 4 years ago
Zendesk ticket #4842 has been linked to this issue.
These needs also came up at the recent Education Offsite.
Now that we have automatic ballast files on node startup, do we need detailed guidance here still? @mwang1026, thoughts? Users can still set the ballast-size
, so maybe we do?
I don't think so? We have a default size that we can document (I believe it's something like 1GB or 1% of disk) (But we should check before documenting that exactly :D )
Florence Morris (florence-crl) commented:
from @knz there's a formula that's easier to understand than to explain. The idea is to combine two things. 1) how fast their data grows over time. To know this they should use metrics/monitoring and plot their storage growth over days/weeks/months. They also need to understand their storage spikes (e..g Bulk I/O events and the necessary disk space for them) 2) how fast they are able to react to a "low storage" condition, e.g by adding nodes or more disk space. Some businesses can react within 1 day, others need 2 weeks to work on it.
Once they know these two things, they need to choose a ballast that covers the amount of disk space growing (1) during their reaction period (2).
Examples:
One layer of complexity is that the intermediate state of the growth can appear larger than the long-term state, because of RocksDB compactions. For example if they create a lot of data quickly, there will be more disk usage than what they have put in their SQL, until RocksDB compacts it.
Another layer is MVCC: if they delete data, the data is still around until it is GC'ed (zone config, default 25 hours). So if their workload is delete-heavy they need to consider that.
Both things can be reliably ignored if their disk usage evolves slowly (which is common) and they can monitor it at a high level (e.g. our capacity metric in the UI, or if they do their own export using prometheus)
from @jseldess An addition is that we need to strongly recommend that they put alerts in place to notify them of “low storage” conditions so they can set their process in place. For example, when a node is running low on disk space and using prometheus metrics. Ideally, a customer shouldn’t get to the point where they need to use a ballast file.
cc: @Annebirzin @piyush-singh since this ties into observability and alerting
Jira Issue: DOC-453