Open julienlim opened 7 years ago
@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano
This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi.
Suggested Labels (for folks who have permissions to label the spec):
I have a question about panels numbers.
IIANM Grafana
doesn't use any number for panel in configuration. Moreover I noticed that some panels have the same number as others. Is it intended or just some coincidence? Is it even possible to have the same panel on more rows?
@ltrilety I put the panel numbers for specification purposes (so that if someone comments, they can specify panel #), and it's not to be implemented with a panel #. If some panels have the same panel numbers, it's a typo on my end. I'll fix it. Thanks.
Row 1 Panel 1: Health show host status? Valid volume states are up, down, up(partial) and up(degraded)
Panel 4: Disks No platform support for disk status as such. This won't be supported now
Panel 5: Geo-Replication Sessions Valid states are : up, down, up(partial)
Panel 6: Healing Isn't n healing needed and n split brain are same?
Panel 7: Rebalance Chart type is not specified
Row 3 Panel 18: IO Size Not MVP
Panel 20: LVM thin pool metadata Panel 21: LVM thin pool data usage Does it make sense to aggregate these stats at volume level. These are LVM level stats specific to bricks, what is the value add it brings by aggregation at volume level?
Items specified in Row 4,5,6,7 are not MVP.
@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi
(1) Volume health, I read https://github.com/gluster/gstatus to see what up(partial) vs. up(degraded) meant. The "show host status" is a typo (from cut-n-paste) and it is really "show volume status" (I've fixed it above). I've updated Panel 1 in the spec above with the statuses per gstatus. How about quorum lost (in the case of an arbiter volume), or is this not supported or not possible?
(2) Ack. No support for Disk status. Will mark as FUTURE in the spec above (so it gets noted as a placeholder for future consideration).
(3) Geo-Replication Sessions - I've updated the statuses in Panel 5 in the above spec.
(4) Healing: healing and split brain are not the same. Healing is something that typically happens automatically (and does not require user intervention but gives indication about how "healthy" the files are), and healing can happen after a split brain. Anything thing caused by a split brain (without parameters/policy configured to trigger self-heal) will require user action to manually resolve.
(5) Rebalance -- I was waiting to check with Jeff regarding if we're doing rebalancing in the Tendrl UI or Grafana. Based on some recent conversations, I think I will assume the latter and will update this soon.
(6) IO Size -- not MVP. Ack. Will mark as FUTURE. This typically goes hand-in-hand with IOPS in storage management/monitoring applications.
(7) Rows 4 not clearly called out in MVP. Ack.
(8) Rows 5, 6, and 7 are all from volume storage profiling that we will be enabling/disabling during Import Cluster. This was called out in the Gluster Metrics discussion that we would collect this information. If we collect it, I was assuming we would be visualizing them.
@japplewhite @jjkabrown1 - please comment.
@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi
(1) Volume health, I read https://github.com/gluster/gstatus to see what up(partial) vs. up(degraded) meant. The "show host status" is a typo (from cut-n-paste) and it is really "show volume status" (I've fixed it above). I've updated Panel 1 in the spec above with the statuses per gstatus. How about quorum lost (in the case of an arbiter volume), or is this not supported or not possible?
get-state cli provides the quorum status and tendrl syncs this into etcd. Quorum status is derived from brick status(get-state). Brick status is also used in volume health computation and will be reflected there.
(2) Ack. No support for Disk status. Will mark as FUTURE in the spec above (so it gets noted as a placeholder for future consideration).
(3) Geo-Replication Sessions - I've updated the statuses in Panel 5 in the above spec.
(4) Healing: healing and split brain are not the same. Healing is something that typically happens automatically (and does not require user intervention but gives indication about how "healthy" the files are), and healing can happen after a split brain. Anything thing caused by a split brain (without parameters/policy configured to trigger self-heal) will require user action to manually resolve.
My whole point is that <<n healing needed - total number (n) of entries that need healing based on healinfo>> is not reported by healinfo(gluster) healinfo provides the below information: No. of entries healed No. of entries in split-brain No. of heal failed entries
(5) Rebalance -- I was waiting to check with Jeff regarding if we're doing rebalancing in the Tendrl UI or Grafana. Based on some recent conversations, I think I will assume the latter and will update this soon.
(6) IO Size -- not MVP. Ack. Will mark as FUTURE. This typically goes hand-in-hand with IOPS in storage management/monitoring applications.
(7) Rows 4 not clearly called out in MVP. Ack.
(8) Rows 5, 6, and 7 are all from volume storage profiling that we will be enabling/disabling during Import Cluster. This was called out in the Gluster Metrics discussion that we would collect this information. If we collect it, I was assuming we would be visualizing them.
@japplewhite @jjkabrown1 - please comment.
@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano
(1) Volume health, I read https://github.com/gluster/gstatus to see what up(partial) vs. up(degraded) meant. The "show host status" is a typo (from cut-n-paste) and it is really "show volume status" (I've fixed it above). I've updated Panel 1 in the spec above with the statuses per gstatus. How about quorum lost (in the case of an arbiter volume), or is this not supported or not possible? get-state cli provides the quorum status and tendrl syncs this into etcd.
Quorum status is derived from brick status(get-state). Brick status is also used in volume health computation and will be reflected there.
I’ll take this to mean quorum is either not applicable or should not be shown at the volume level. I’ve remove/update.
(4) Healing: healing and split brain are not the same. Healing is something that typically happens automatically (and does not require user intervention but gives indication about how "healthy" the files are), and healing can happen after a split brain. Anything thing caused by a split brain (without parameters/policy configured to trigger self-heal) will require user action to manually resolve.
My whole point is that <<n healing needed - total number (n) of entries that need healing based on healinfo>> is not reported by healinfo(gluster) > healinfo provides the below information: > No. of entries healed > No. of entries in split-brain > No. of heal failed entries
I meant for n healing needed - total number (n) of entries that need healing == no. of heal failed entries. This is meant to indicate action is required to investigate. I’ll update it to make it clearer to and also included entries that were healed:
Updated Panel 7: Rebalance
@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano
Here's a mockup of the Volume Dashboard:
@julienlim the design differs with one extra panel Capacity Utilization Trend
. Is it expected behaviour?
Noting that the IOPS Trend panel is not present yet. BZ has been created to track this as well at https://bugzilla.redhat.com/show_bug.cgi?id=1514054.
Dashboard Spec - Volume Dashboard
Display a default dashboard for a Gluster volume present in Tendrl that provides at-a-glance information about a single Gluster volumethat includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the volume, brick, and disk.
Problem description
A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:
Use Cases
Uses Cases in the form of user stories:
As a Gluster Administrator, I want to view at-a-glance information about my Gluster volume that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight my attention to potential issues in the volume, brick, and disk.
As a Gluster Administrator, I want to compare 1 or more metrics (e.g. IOPS, CPU, Memory, Network Load) across bricks within the volume
Compare utilization (e.g. IOPS, capacity, etc.) across bricks within a volume
Look at performance by brick (within a volume) to address diagnosing of RAID 6 disk failure/rebuild/degradation poor performance on one brick
Proposed change
Provide a pre-canned, default volume dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:
The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.
Note: The cluster and volume name or unique identifier should be visible at all times, and user should be able to switch to another volume.
Row 1
Panel 1: Health
Panel 2: Subvolumes
Panel 3: Bricks
[FUTURE] Panel 4: Disks
Panel 5: Geo-Replication Sessions
Panel 6: Healing
Panel 7: Rebalance
Panel 8: Snapshots
Panel 9: Connections Trend
Row 2
Panel 10: Capacity Utilization
Panel 11: Capacity Available
Panel 12: Growth Rate
Panel 13: Time Remaining (Weeks)
Panel 14: Inodes Utilization
Panel 15: Inodes Available
Panel 16: Quotas???
Row 3
Panel 17: IOPS Trend
[FUTURE] Panel 18: IO Size
Panel 19: Throughput Trend
Panel 20: LVM thin pool metadata %
Panel 21: LVM thin pool data usage %
Row 4
Panel 22: Top Connections
Panel 23: Top Utilized Bricks
Panel 24: Top busiest bricks
Row 5 (part of volume storage profile - should this be its own Volume Storage Profile dashboard for performance reasons?)
Panel 25: Top File Operation (% Latency)
Panel 26: Reads and Writes by Block Size
Row 6 (part of volume storage profile - should this be its own Volume Storage Profile dashboard for performance reasons?)
Panel 27: File Operations for Locks Trend
Panel 28: File Operations for Read/Write Operations Trend
Row 7 (part of volume storage profile - should this be its own Volume Storage Profile dashboard for performance reasons?)
Panel 29: File Operations for Inode Operations Trend
Panel 30: File Operations for Entry Operations Trend
[1] There exists approximately 46 File Operations (FOPs) that would need to be mapped into 4 categories for the data to be consumable for troubleshooting in order to identify patterns:
List of [FOP Categories] to FOPs:
Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.
Alternatives
Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.
Data model impact:
TBD
Impacted Modules:
TBD
Tendrl API impact:
TBD
Notifications/Monitoring impact:
TBD
Tendrl/common impact:
TBD
Tendrl/node_agent impact:
TBD
Sds integration impact:
TBD
Security impact:
TBD
Other end user impact:
User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.
Performance impact:
TBD
Other deployer impact:
Plug-ins required by Grafana will need to be packaged and installed with tendrl-ansible.
This (default) host dashboard will need to be automatically generated whenever a cluster is imported to be managed by Tendrl.
Developer impact:
TBD
Implementation:
TBD
Assignee(s):
Primary assignee: @cloudbehl
Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite
Work Items:
TBD
Estimate:
TBD
Dependencies:
TBD
Testing:
Test whether health, status, and metrics displayed for a given volume is correct and that the information is up-to-date as failures or other changes are observed on a given volume.
Documentation impact:
Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.
References and Related GitHub Links: