Tendrl / specifications

Tendrl specs go here
GNU Lesser General Public License v3.0
6 stars 16 forks source link

Dashboard Spec - Cluster Dashboard #222

Closed julienlim closed 6 years ago

julienlim commented 7 years ago

Dashboard Spec - Cluster Dashboard

Display a default dashboard for a single Gluster cluster present in Tendrl that provides at-a-glance information about a single Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the cluster, host, volume, and brick.

Problem description

A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:

Use Cases

Uses Cases in the form of user stories:

Proposed change

Provide a pre-canned, default cluster dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:

The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.

Note: The cluster name/ID should be visible at all times, and user should be able to switch to another cluster.

Row 1

Panel (Dashboard Widgit) 1: Health - Cluster Health

Panel 2: Hosts

Panel 3: Volumes

Panel 4: Bricks

[FUTURE] Panel 5: Disks

Panel 6: Snapshots

Panel 7: Geo-replication Sessions

Panel 8: Connections Trend

Row 2

Panel 9: Capacity Utilization

Panel 10: Capacity Available

Panel 11: Growth Rate

Panel 12: Time Remaining (Weeks)

[FUTURE] Panel 13: Services Trend

Panel 14: IOPS Trend

Panel 15: IO Size

Panel 16: Network Throughput Trend

Row 3

Panel 17: Top volumes by capacity utilization

Panel 18: Top bricks by capacity utilization

Row 4

Panel 19: CPU used by Host

Panel 20: Memory used by Host

Panel 21: Ping Latency by Host Trend

Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.

Alternatives

Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.

Data model impact:

TBD

Impacted Modules:

TBD

Tendrl API impact:

TBD

Notifications/Monitoring impact:

TBD

Tendrl/common impact:

TBD

Tendrl/node_agent impact:

TBD

Sds integration impact:

TBD

Security impact:

TBD

Other end user impact:

User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.

Performance impact:

TBD

Other deployer impact:

Developer impact:

TBD

Implementation:

TBD

Assignee(s):

Primary assignee: @cloudbehl

Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite

Work Items:

TBD

Estimate:

TBD

Dependencies:

TBD

Testing:

Test whether health, status, and metrics displayed for a given cluster is correct and that the information is up-to-date as failures or other cluster changes are observed.

Documentation impact:

Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.

References and Related GitHub Links:

julienlim commented 7 years ago

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi.

Suggested Labels (for folks who have permissions to label the spec):

nthomas-redhat commented 7 years ago

Row-1: Panel (Dashboard Widgit) 1: Health - Cluster Health Cluster status will have only two values, Healthy or Unhealthy. This is inline with what gstatus is doing and we would like to stick with the same

Panel 3: Volumes Volume has states up(partial) and up(degraded) as well

Panel 5: Disks No platform support for disk status as such. This won't be supported now

Panel 7: Geo-replication Sessions What's the difference between active and up? What we are planing to support now is: up, down, up(partial)

Row 2 Panel 13: Services Trend Can we get some clarity around this? Is it part of MVP?

julienlim commented 7 years ago

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi

I've addressed and updated Panels 1, 3, 5, and 7 per @nthomas-redhat's comments. They should align with https://github.com/gluster/gstatus.

For geo-rep, I was following what we had in the Gluster metrics document we had previously, but have updated it per what the plan for support is now.

For Panel 13 (services trend), I raised this a few times in BLR, and I'm suggesting this to have parity with the old Console. This was the only we didn't address. The use scenario is that there's not easy way for Admins to know if their services/daemons die today or are still ok, and this is a means for monitoring their health. I will defer this to @japplewhite if this is part of the MVP.

julienlim commented 7 years ago

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano asrivast@redhat.com

I've put a very rough mockup together to show what the cluster dashboard might look like:

grafana dashboard - cluster

julienlim commented 6 years ago

Noting that geo-rep session status changes planned per https://github.com/Tendrl/gluster-integration/issues/459.

julienlim commented 6 years ago

Updated the Geo-replication Session Panel per georep session status changes.

@shtripat @nthomas-redhat @cloudbehl @Tendrl/tendrl-qe @mcarrano

r0h4n commented 6 years ago

Closing this one, please file new issue with relevant context if anything missing