julienlim commented 7 years ago

Dashboard Spec - Brick Dashboard

Display a default dashboard for a Gluster brick present in Tendrl that provides at-a-glance information about a single Gluster brick that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the brick and its underlying disk(s).

Problem description

A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:

Is my brick up and running, is it healthy?
Is there a problem with my brick?
What’s actually wrong with my brick, why it it slow?
Is my brick filling up too fast?
When will my brick run out of capacity?
If something is down / broken / failed (e.g. brick down, disk failure, etc.), where and what is the issue, and when did it happen?
What is the impact to the volume or processes (e.g. healing, rebalancing) when the brick is down or offline?
Have the number of clients (indicated via connections) increased (which may possibly be the reason for the performance degradation that the clients / applications are observing?

Use Cases

Uses Cases in the form of user stories:

As a Gluster Administrator, I want to view at-a-glance information about my Gluster brick that includes health and status information, key performance indicators (e.g. IOPS, throughput, latency, etc.), and alerts that can highlight my attention to potential issues in the brick and underlying disks.
Look at performance by brick to address diagnosing of RAID 6 disk failure/rebuild/degradation poor performance on one brick

Proposed change

Provide a pre-canned, default brick dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:

The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.

Note: The cluster, host, and brick should be visible at all times, and user should be able to switch to another host + brick combination.

Row 1

Panel 1: Health

show brick status, i.e. Up, Down, Unknown
the color of panel should be green when OK, red when down, yellow when unknown
chart type: Singlestat (see http://docs.grafana.org/features/panels/singlestat/) for further information

Panel 2: Connections Trend

count (n) of client connections to the bricks in the volume over a period of time
chart type: Line Chart / Spark

[FUTURE] Panel 4: Disks

n total - total number (n) of disks in the brick
n up - count (n) of disks in the brick that are up
n down - count (n) of disks in the brick that are down
chart type: Stacked Card

Panel 4: Capacity Utilization

Disk space used for the brick
chart type: Gauge

Panel 5: Capacity Available

Disk space free for the brick
chart type: Singlestat

Panel 6: Growth Rate

growth rate computed based on beginning and last end point to perform estimation
chart type: Singlestat

Panel 7: Time Remaining (Weeks)

based on projected growth rate in Panel 5, provide estimated # of weeks remaining
chart type: Singlestat

Row 2

Panel 8: IOPS Trend

show the IOPS for the brick over a period of time
chart type: Line Chart / Spark

[FUTURE] Panel 9: IO Size

show IO Size
chart type: Singlestat

Panel 10: Inodes Utilization

Inodes used for the brick over a period of time
chart type: Line Chart / Spark

Panel 11: Inodes Available

Inodes free for the brick
chart type: Singlestat

Panel 12: LVM thin pool metadata %

LVM thin pool metadata %
infotip: Monitoring the utilization of LVM thin pool metadata and data usage is important to ensure they don't run out of space. If the data space is exhausted then, based on the configuration, I/O operations are either queued or failing. If metadata space is exhausted, you will observe error I/O's until the LVM pool is taken offline and repair is performed to fix potential inconsistencies. Moreover, due to the metadata transaction being aborted and the pool doing caching there might be uncomitted (to disk) I/O operations that were acknowledged to the upper storage layers (file system) so those layers will need to have checks/repairs performed as well.
chart type: Line Chart / Spark

Panel 13: LVM thin pool data usage %

LVM thin pool data usage %
infotip: Monitoring the utilization of LVM thin pool metadata and data usage is important to ensure they don't run out of space. If the data space is exhausted then, based on the configuration, I/O operations are either queued or failing. If metadata space is exhausted, you will observe error I/O's until the LVM pool is taken offline and repair is performed to fix potential inconsistencies. Moreover, due to the metadata transaction being aborted and the pool doing caching there might be uncomitted (to disk) I/O operations that were acknowledged to the upper storage layers (file system) so those layers will need to have checks/repairs performed as well.
chart type: Line Chart / Spark

Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.

Alternatives

Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.

Data model impact:

TBD

Impacted Modules:

TBD

Tendrl API impact:

TBD

Notifications/Monitoring impact:

TBD

Tendrl/common impact:

TBD

Tendrl/node_agent impact:

TBD

Sds integration impact:

TBD

Security impact:

TBD

Other end user impact:

User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.

Performance impact:

TBD

Other deployer impact:

Plug-ins required by Grafana will need to be packaged and installed with tendrl-ansible.
This (default) host dashboard will need to be automatically generated whenever a cluster is imported to be managed by Tendrl.

Developer impact:

TBD

Implementation:

TBD

Assignee(s):

Primary assignee: @cloudbehl

Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite

Work Items:

TBD

Estimate:

TBD

Dependencies:

TBD

Testing:

Test whether health, status, and metrics displayed for a given volume is correct and that the information is up-to-date as failures or other changes are observed on a given volume.

Documentation impact:

Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.

References and Related GitHub Links:

Gluster Metrics (https://github.com/Tendrl/documentation/wiki/Metrics)
Gluster metrics (https://github.com/Tendrl/specifications/issues/188)
Initial onboarding experience for user accessing Tendrl UI (https://github.com/Tendrl/specifications/issues/200)
Drill-down navigation in grafana dashboard (https://github.com/Tendrl/specifications/issues/189)
Use Grafana for Tendrl monitoring (https://github.com/Tendrl/specifications/issues/168)
Build package scripts for tendrl-monitoring-integration (https://github.com/Tendrl/specifications/issues/178)

julienlim commented 7 years ago

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi.

Suggested Labels (for folks who have permissions to label the spec):

FEATURE:Monitoring
INTERFACE:Dashboard
INTERFACE:GUI

nthomas-redhat commented 7 years ago

Row 1 Panel 4: Disks No platform support for disk status as such. This won't be supported now

Panel 6: Growth Rate Panel 7: Time Remaining (Weeks) Does this really make any sense to display at the brick level? MVP just talks about the projections at volume level only.

Row 2 Panel 9: IO Size Not MVP

julienlim commented 7 years ago

@nthomas-redhat I've updated and marked Panel 4 (disks) and Panel 9 (IO size) as FUTURE.

For Panel 6 & 7 (Growth Rate and Time Remaining), it's a valid question whether or not it makes sense to project this for a brick, and I would say yes. Per chatting with Alok, there are several users who have single bricks on a single node within the Gluster cluster. While you can see this by host and volume, seeing it by brick can also be valuable (though admittedly redundant if you're doing it by host, except with the host, it includes the boot disk and other disks not used for Gluster volumes).

In addition, if we calculate growth and time remaining easily for the others, replicating here is trivial. So I think it should remain in the dashboard.

julienlim commented 7 years ago

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

Here's a rough mockup of the proposed brick dashboard:

grafana dashboard - brick

julienlim commented 6 years ago

Noting some additional panels added -- see Bricks dashboard: Unclear what "Utilization" panel is showing.

r0h4n commented 6 years ago

Closing this one, please open new issue with relevant context if anything is missing

Tendrl / specifications