Open cloudbehl opened 6 years ago
@r0h4n @brainfunked @shtripat Please review @Tendrl/tendrl-qe Please verify that the disk usage for a single cluster is approx to value that is calculated via this formula. Please do let me know if you require some help.
@ltrilety @r0h4n @nthomas-redhat @Tendrl/qe @jjkabrown1
Have we done any Graphite disk usage estimates for following sizes (I tried to come up with 3 "sizes" based on some typical deployments):
Small
Medium
Large
In https://github.com/Tendrl/commons/issues/819 @ltrilety mentioned "1 day of metrics for 6 gluster servers [Small] takes about 10G."
@julienlim, the formula for the calculation is already provided in https://github.com/Tendrl/monitoring-integration/issues/261#issue-273785961, which is as below:
Size on disk: 49767242(~48 MB per cluster) + (no of host * (~382 MB per host)) + (no of LVM disk * (24 MB per disk)) +(no of virtual disk * (30 MB per disk)) + (no of bricks * (86 MB per brick)) + (no of devices * (36 MB per device)) + (no of volume * (~44.5 MB per volume)) + (no of hosts * (12 MB per host)) + (no of bricks * (98 MB per brick)) + (no of devices * (36 MB per device))
Size may vary depends on the no disks, lvms etc. So I would reccomend to calculate this based on per deployment basis. What you think?
@nthomas-redhat @julienlim @shtripat it's great we have a formula for 180 days period, but from what I see we should simplify it as it's not easy to read. We could use deployments in https://github.com/Tendrl/monitoring-integration/issues/261#issuecomment-367400672 and provide some numbers. Of course first we have to decide how long we keep these metrics data. Moreover don't forget to count with free space for un-managing. Oh one more thing those typical deployments are a little strange for me, as for example medium ends with 12 nodes, but large begins with 24. So then there's a question where 20 nodes belongs and so on. Anyway the idea is there.
@nthomas-redhat @shtripat @ltrilety @jjkabrown1
It's good to have a formula for the 180 day period, but we'll need to adjust according to the retention policies.
That being said, this formula is too cumbersome for someone to calculate. We need to provide an easy-to-use calculator (think Ceph's pgcalc-like or some kind of spreadsheet), where user can input some numbers (e.g. # nodes in cluster, # clusters, # volumes, #bricks, and how long to retain data), and it provides an estimate.
@ltrilety As to the deployment sizes, I took a first stab at try to come up with something, and it does need further discussion and tweaking. Suggestions?
@cloudbehl, @r0h4n , let us sync up and put together guidelines for possible standard configurations
@nthomas-redhat ack!
@nthomas-redhat Please provide change in disk size requirements for Graphite for below scenarios
A note for the assessment. Don't forget that un-manage screws the counting as it takes all data from graphite and saved them on /usr/share/tendrl/graphite/archive
path. That brings several questions:
For standard cluster sizes please see below:
Small Configuration
Up to 8 Nodes 6-8 volumes per cluster Number of bricks are 2 - 3 per node for replicated volumes with RAID 6 and 12 - 36 per node for EC volumes
Recommendation: 200 GB free size per cluster for this configuration
Medium Configuration
9 - 16 Nodes 6-8 volumes per cluster Number of bricks are 2 - 3 per node for replicated volumes with RAID 6 and 12 - 36 per node for EC volumes
Recommendation 350GB free size per cluster for this configuration
Large Configuration 17 - 24 Nodes 6-8 volumes per cluster Number of bricks are 2 - 3 per node for replicated volumes with RAID 6 and 12 - 36 per node for EC volumes
Recommendation 500 GB free size per cluster for this configuration
Graphite Disk Usage Calculation
Whisper storage utilization
Per data point: 12 bytes Per metric: 12 no of data points so for 60s:180d retention (60 24 180 data points) 12 bytes = 3110400 (~ 2.97 MB) or for 10s:180d retention (6 60 24 180 data points) 12 bytes = 18662400 (~ 17.8 MB)
The calculations below are based on Tendrl’s default storage retention policy of all metrics consisting of data points at 60 seconds interval being stored for 180 days.
There are currently two trees to enable grafana navigation:
Cluster -> Volume -> Node -> Brick -> Block Device
Cluster -> Node -> Brick -> Block Device
Cluster -> Volume -> Node -> Brick -> Block Device
This tree contains all the cluster specific information for Volumes, Nodes, Bricks and Block Devices. This tree does NOT contain Node specific information. Nodes contain information only as relates to the cluster, such as rebalance information.
Block Device Size on disk: 37325481 (~36 MB) Structure: ├── disk_octets │ ├── read.wsp │ └── write.wsp ├── disk_ops │ ├── read.wsp │ └── write.wsp ├── disk_time │ ├── read.wsp │ └── write.wsp ├── mount_utilization │ ├── percent_used.wsp │ ├── total.wsp │ └── used.wsp └── utilization ├── percent_used.wsp ├── total.wsp └── used.wsp
Brick Size on disk: 102648857 (~98MB per brick) + (no of devices * (36 MB per devices)) Structure: ├── connections_count.wsp ├── device │ └── vda │ ├── disk_octets │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_ops │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_time │ │ ├── read.wsp │ │ └── write.wsp │ ├── mount_utilization │ │ ├── percent_used.wsp │ │ ├── total.wsp │ │ └── used.wsp │ └── utilization │ ├── percent_used.wsp │ ├── total.wsp │ └── used.wsp ├── entry_ops.wsp ├── fop │ ├── GETXATTR │ │ ├── hits.wsp │ │ ├── latencyAvg.wsp │ │ ├── latencyMax.wsp │ │ └── latencyMin.wsp │ ├── LOOKUP │ │ ├── hits.wsp │ │ ├── latencyAvg.wsp │ │ ├── latencyMax.wsp │ │ └── latencyMin.wsp │ ├── OPENDIR │ │ ├── hits.wsp │ │ ├── latencyAvg.wsp │ │ ├── latencyMax.wsp │ │ └── latencyMin.wsp │ └── READDIR │ ├── hits.wsp │ ├── latencyAvg.wsp │ ├── latencyMax.wsp │ └── latencyMin.wsp ├── healed_cnt.wsp ├── heal_failed_cnt.wsp ├── inode_ops.wsp ├── inode_utilization │ ├── gauge-total.wsp │ ├── gauge-used.wsp │ └── percent-percent_bytes.wsp ├── iops │ ├── gauge-read.wsp │ └── gauge-write.wsp ├── lock_ops.wsp ├── read_write_ops.wsp ├── split_brain_cnt.wsp ├── status.wsp └── utilization ├── gauge-total.wsp ├── gauge-used.wsp └── percent-percent_bytes.wsp
Node Size on disk: 12441712 (~12 MB per host) + (no of bricks (98MB per brick)) + (no of devices (36 MB per device)) Structure: ├── bricks │ └── |root|gluster_bricks|vol1_b2 │ ├── connections_count.wsp │ ├── device │ │ └── vda │ │ ├── disk_octets │ │ │ ├── read.wsp │ │ │ └── write.wsp │ │ ├── disk_ops │ │ │ ├── read.wsp │ │ │ └── write.wsp │ │ ├── disk_time │ │ │ ├── read.wsp │ │ │ └── write.wsp │ │ ├── mount_utilization │ │ │ ├── percent_used.wsp │ │ │ ├── total.wsp │ │ │ └── used.wsp │ │ └── utilization │ │ ├── percent_used.wsp │ │ ├── total.wsp │ │ └── used.wsp │ ├── inode_utilization │ │ ├── gauge-total.wsp │ │ ├── gauge-used.wsp │ │ └── percent-percent_bytes.wsp │ ├── status.wsp │ └── utilization │ ├── gauge-total.wsp │ ├── gauge-used.wsp │ └── percent-percent_bytes.wsp ├── rebalance_bytes.wsp ├── rebalance_failures.wsp ├── rebalance_files.wsp └── rebalance_skipped.wsp
Volume Size on disk: 46656545 (~44.5 MB per volume) + (no of hosts (12 MB per host)) + (no of bricks (98MB per brick)) + (no of devices * (36 MB per device)) Structure: ├── brick_count │ ├── down.wsp │ ├── total.wsp │ └── up.wsp ├── geo_rep_session │ ├── down.wsp │ ├── partial.wsp │ ├── total.wsp │ └── up.wsp ├── nodes │ ├── dhcp43-54_lab_eng_blr_redhat_com │ │ ├── bricks │ │ │ └── |root|gluster_bricks|vol1_b2 │ │ │ ├── connections_count.wsp │ │ │ ├── device │ │ │ │ └── vda │ │ │ │ ├── disk_octets │ │ │ │ │ ├── read.wsp │ │ │ │ │ └── write.wsp │ │ │ │ ├── disk_ops │ │ │ │ │ ├── read.wsp │ │ │ │ │ └── write.wsp │ │ │ │ ├── disk_time │ │ │ │ │ ├── read.wsp │ │ │ │ │ └── write.wsp │ │ │ │ ├── mount_utilization │ │ │ │ │ ├── percent_used.wsp │ │ │ │ │ ├── total.wsp │ │ │ │ │ └── used.wsp │ │ │ │ └── utilization │ │ │ │ ├── percent_used.wsp │ │ │ │ ├── total.wsp │ │ │ │ └── used.wsp │ │ │ ├── inode_utilization │ │ │ │ ├── gauge-total.wsp │ │ │ │ ├── gauge-used.wsp │ │ │ │ └── percent-percent_bytes.wsp │ │ │ ├── status.wsp │ │ │ └── utilization │ │ │ ├── gauge-total.wsp │ │ │ ├── gauge-used.wsp │ │ │ └── percent-percent_bytes.wsp │ │ ├── rebalance_bytes.wsp │ │ ├── rebalance_failures.wsp │ │ ├── rebalance_files.wsp │ │ └── rebalance_skipped.wsp │ └── dhcp43-83_lab_eng_blr_redhat_com │ ├── rebalance_bytes.wsp │ ├── rebalance_failures.wsp │ ├── rebalance_files.wsp │ └── rebalance_skipped.wsp ├── pcnt_used.wsp ├── rebal_status.wsp ├── snap_count.wsp ├── state.wsp ├── status.wsp ├── subvol_count.wsp ├── usable_capacity.wsp └── used_capacity.wsp
Cluster -> Node -> Brick -> Block Device
This tree contains all the cluster specific information for Nodes, Bricks and Block Devices.
Block Device Size on disk: 37325481 (~36 MB) Structure: ├── disk_octets │ ├── read.wsp │ └── write.wsp ├── disk_ops │ ├── read.wsp │ └── write.wsp ├── disk_time │ ├── read.wsp │ └── write.wsp ├── mount_utilization │ ├── percent_used.wsp │ ├── total.wsp │ └── used.wsp └── utilization ├── percent_used.wsp ├── total.wsp └── used.wsp
Brick - Without file operations Size on disk: 40435965 (~39 MB per brick) + (no of devices * (36 MB per devices)) ├── device │ └── vda │ ├── disk_octets │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_ops │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_time │ │ ├── read.wsp │ │ └── write.wsp │ ├── mount_utilization │ │ ├── percent_used.wsp │ │ ├── total.wsp │ │ └── used.wsp │ └── utilization │ ├── percent_used.wsp │ ├── total.wsp │ └── used.wsp ├── entry_ops.wsp ├── inode_ops.wsp ├── inode_utilization │ ├── gauge-total.wsp │ ├── gauge-used.wsp │ └── percent-percent_bytes.wsp ├── iops │ ├── gauge-read.wsp │ └── gauge-write.wsp ├── lock_ops.wsp ├── read_write_ops.wsp ├── status.wsp └── utilization ├── gauge-total.wsp ├── gauge-used.wsp └── percent-percent_bytes.wsp
With File operations Size on disk: 90203242 (~86MB per brick) + (no of devices * (36 MB per devices)) ├── device │ └── vda │ ├── disk_octets │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_ops │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_time │ │ ├── read.wsp │ │ └── write.wsp │ ├── mount_utilization │ │ ├── percent_used.wsp │ │ ├── total.wsp │ │ └── used.wsp │ └── utilization │ ├── percent_used.wsp │ ├── total.wsp │ └── used.wsp ├── entry_ops.wsp ├── fop │ ├── GETXATTR │ │ ├── hits.wsp │ │ ├── latencyAvg.wsp │ │ ├── latencyMax.wsp │ │ └── latencyMin.wsp │ ├── LOOKUP │ │ ├── hits.wsp │ │ ├── latencyAvg.wsp │ │ ├── latencyMax.wsp │ │ └── latencyMin.wsp │ ├── OPENDIR │ │ ├── hits.wsp │ │ ├── latencyAvg.wsp │ │ ├── latencyMax.wsp │ │ └── latencyMin.wsp │ └── READDIR │ ├── hits.wsp │ ├── latencyAvg.wsp │ ├── latencyMax.wsp │ └── latencyMin.wsp ├── inode_ops.wsp ├── inode_utilization │ ├── gauge-total.wsp │ ├── gauge-used.wsp │ └── percent-percent_bytes.wsp ├── iops │ ├── gauge-read.wsp │ └── gauge-write.wsp ├── lock_ops.wsp ├── read_write_ops.wsp ├── status.wsp └── utilization ├── gauge-total.wsp ├── gauge-used.wsp └── percent-percent_bytes.wsp
Node
Size on disk: 401282895 (~382 MB per host) + (no of LVM disk (24 MB per disk)) +(no of virtual disk (30 MB per disk)) + (no of bricks (86 MB per brick)) + (no of devices (36 MB per device)) . ├── aggregation-memory-sum │ └── memory.wsp ├── aggregation-swap-sum │ └── swap.wsp ├── brick_count │ ├── down.wsp │ ├── total.wsp │ └── up.wsp ├── bricks │ ├── |root|bricks|v1 │ │ ├── device │ │ │ └── vda │ │ │ ├── disk_octets │ │ │ │ ├── read.wsp │ │ │ │ └── write.wsp │ │ │ ├── disk_ops │ │ │ │ ├── read.wsp │ │ │ │ └── write.wsp │ │ │ ├── disk_time │ │ │ │ ├── read.wsp │ │ │ │ └── write.wsp │ │ │ ├── mount_utilization │ │ │ │ ├── percent_used.wsp │ │ │ │ ├── total.wsp │ │ │ │ └── used.wsp │ │ │ └── utilization │ │ │ ├── percent_used.wsp │ │ │ ├── total.wsp │ │ │ └── used.wsp │ │ ├── entry_ops.wsp │ │ ├── inode_ops.wsp │ │ ├── inode_utilization │ │ │ ├── gauge-total.wsp │ │ │ ├── gauge-used.wsp │ │ │ └── percent-percent_bytes.wsp │ │ ├── iops │ │ │ ├── gauge-read.wsp │ │ │ └── gauge-write.wsp │ │ ├── lock_ops.wsp │ │ ├── read_write_ops.wsp │ │ ├── status.wsp │ │ └── utilization │ │ ├── gauge-total.wsp │ │ ├── gauge-used.wsp │ │ └── percent-percent_bytes.wsp │ ├── cpu │ ├── percent-idle.wsp │ ├── percent-interrupt.wsp │ ├── percent-nice.wsp │ ├── percent-softirq.wsp │ ├── percent-steal.wsp │ ├── percent-system.wsp │ ├── percent-user.wsp │ └── percent-wait.wsp ├── df-boot │ ├── df_complex-free.wsp │ ├── df_complex-reserved.wsp │ ├── df_complex-used.wsp │ ├── df_inodes-free.wsp │ ├── df_inodes-reserved.wsp │ ├── df_inodes-used.wsp │ ├── percent_bytes-free.wsp │ ├── percent_bytes-reserved.wsp │ ├── percent_bytes-used.wsp │ ├── percent_inodes-free.wsp │ ├── percent_inodes-reserved.wsp │ └── percent_inodes-used.wsp ├── df-dev │ ├── df_complex-free.wsp │ ├── df_complex-reserved.wsp │ ├── df_complex-used.wsp │ ├── df_inodes-free.wsp │ ├── df_inodes-reserved.wsp │ ├── df_inodes-used.wsp │ ├── percent_bytes-free.wsp │ ├── percent_bytes-reserved.wsp │ ├── percent_bytes-used.wsp │ ├── percent_inodes-free.wsp │ ├── percent_inodes-reserved.wsp │ └── percent_inodes-used.wsp ├── df-dev-shm │ ├── df_complex-free.wsp │ ├── df_complex-reserved.wsp │ ├── df_complex-used.wsp │ ├── df_inodes-free.wsp │ ├── df_inodes-reserved.wsp │ ├── df_inodes-used.wsp │ ├── percent_bytes-free.wsp │ ├── percent_bytes-reserved.wsp │ ├── percent_bytes-used.wsp │ ├── percent_inodes-free.wsp │ ├── percent_inodes-reserved.wsp │ └── percent_inodes-used.wsp ├── df-root │ ├── df_complex-free.wsp │ ├── df_complex-reserved.wsp │ ├── df_complex-used.wsp │ ├── df_inodes-free.wsp │ ├── df_inodes-reserved.wsp │ ├── df_inodes-used.wsp │ ├── percent_bytes-free.wsp │ ├── percent_bytes-reserved.wsp │ ├── percent_bytes-used.wsp │ ├── percent_inodes-free.wsp │ ├── percent_inodes-reserved.wsp │ └── percent_inodes-used.wsp ├── df-run │ ├── df_complex-free.wsp │ ├── df_complex-reserved.wsp │ ├── df_complex-used.wsp │ ├── df_inodes-free.wsp │ ├── df_inodes-reserved.wsp │ ├── df_inodes-used.wsp │ ├── percent_bytes-free.wsp │ ├── percent_bytes-reserved.wsp │ ├── percent_bytes-used.wsp │ ├── percent_inodes-free.wsp │ ├── percent_inodes-reserved.wsp │ └── percent_inodes-used.wsp ├── df-run-user-0 │ ├── df_complex-free.wsp │ ├── df_complex-reserved.wsp │ ├── df_complex-used.wsp │ ├── df_inodes-free.wsp │ ├── df_inodes-reserved.wsp │ ├── df_inodes-used.wsp │ ├── percent_bytes-free.wsp │ ├── percent_bytes-reserved.wsp │ ├── percent_bytes-used.wsp │ ├── percent_inodes-free.wsp │ ├── percent_inodes-reserved.wsp │ └── percent_inodes-used.wsp ├── df-sys-fs-cgroup │ ├── df_complex-free.wsp │ ├── df_complex-reserved.wsp │ ├── df_complex-used.wsp │ ├── df_inodes-free.wsp │ ├── df_inodes-reserved.wsp │ ├── df_inodes-used.wsp │ ├── percent_bytes-free.wsp │ ├── percent_bytes-reserved.wsp │ ├── percent_bytes-used.wsp │ ├── percent_inodes-free.wsp │ ├── percent_inodes-reserved.wsp │ └── percent_inodes-used.wsp ├── disk-dm-0 │ ├── disk_io_time │ │ ├── io_time.wsp │ │ └── weighted_io_time.wsp │ ├── disk_octets │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_ops │ │ ├── read.wsp │ │ └── write.wsp │ └── disk_time │ ├── read.wsp │ └── write.wsp ├── disk-vda │ ├── disk_io_time │ │ ├── io_time.wsp │ │ └── weighted_io_time.wsp │ ├── disk_merged │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_octets │ │ ├── read.wsp │ │ └── write.wsp │ ├── disk_ops │ │ ├── read.wsp │ │ └── write.wsp │ └── disk_time │ ├── read.wsp │ └── write.wsp ├── interface-eth0 │ ├── if_dropped │ │ ├── rx.wsp │ │ └── tx.wsp │ ├── if_errors │ │ ├── rx.wsp │ │ └── tx.wsp │ ├── if_octets │ │ ├── rx.wsp │ │ └── tx.wsp │ └── if_packets │ ├── rx.wsp │ └── tx.wsp ├── memory │ ├── memory-buffered.wsp │ ├── memory-cached.wsp │ ├── memory-free.wsp │ ├── memory-slab_recl.wsp │ ├── memory-slab_unrecl.wsp │ ├── memory-used.wsp │ ├── percent-buffered.wsp │ ├── percent-cached.wsp │ ├── percent-free.wsp │ ├── percent-slab_recl.wsp │ ├── percent-slab_unrecl.wsp │ └── percent-used.wsp ├── ping │ ├── ping-10_70_42_151.wsp │ ├── ping_droprate-10_70_42_151.wsp │ └── ping_stddev-10_70_42_151.wsp ├── status.wsp └── swap ├── percent-cached.wsp ├── percent-free.wsp ├── percent-used.wsp ├── swap-cached.wsp ├── swap-free.wsp ├── swap_io-in.wsp ├── swap_io-out.wsp └── swap-used.wsp
Single cluster (Approx utilization of a cluster)
Size on disk: 49767242(~48 MB per cluster) + (no of host (~382 MB per host)) + (no of LVM disk (24 MB per disk)) +(no of virtual disk (30 MB per disk)) + (no of bricks (86 MB per brick)) + (no of devices (36 MB per device)) + (no of volume (~44.5 MB per volume)) + (no of hosts (12 MB per host)) + (no of bricks (98 MB per brick)) + (no of devices * (36 MB per device))