Mock-up links:

Gluster dashboard Ceph dashboard

Observations:

Missing data-points:

1 Ceph dashboard:

Clusters Quorum lost counter
Storage profile information
Thresholds on cluster utilization and hence near full indication in clusters card.
Alerts for cluster entities as respective bridges don't suuport eventing yet.
Listing pools and rbds by "Busiest" and "Highest Latency"

2 Gluster dashboard:

Clusters Quorum lost counter
No. of files
Clients section
Bricks card completely(No utilization, io and Latency information)
Listing File shares by IO(busiest) and Latency
Service card -- Only glusterd available(And how to aggregate these node level service status into main dashboard level service status)

Approach to be taken for fetching data points in dashboard

1. Each card makes a separate api call to fetch its specific information.

For ex: For the counters, UI needs to invoke respective listing apis and parse the list and form the counters likewise for utilizations, UI needs to fetch utilizations from listing apis responses and total them to get overall utilization and for getting top 5 most used entities(pools, rbd in ceph dashboard and File Shares in gluster dashboard) the listing api needs to support sorting on response.

Advantage:

The different cards in UI can then have independent refresh intervals refreshing the less frequently changing data less frequently and so on...

2. performance-monitoring exposes a single point api that provides all dashboard specific data(only point in time stats and counters) in one query.

Advantage:

UI then needs to make only 2 queries one for utilization time series data and other the performance-monitoring exposed api which provides everything else.

Disadvantage:

Monitoring is an optional stack and if not installed dashboard is either blank or needs to have its own way of fetching whatever is possible without monitoring which means the approach 1 above still needs to be implemented.

@Tendrl/tendrl-core Please provide your suggestions

@anmolbabu Wouldn't the disadvantage in the second approach apply to the first as well? The real downside of the second approach would be the lack of ability to fetch different data sets at different intervals.

In any case, I don't think it would be a bad idea to implement unified API calls, per object, that return all the monitoring data available for a specific cluster, host or a cluster object. Does this sound feasible?

@brainfunked The disadvantage of the 2nd approach doesn't apply to the first because what the 2nd approach is doing is only aggregating the data that's already there in etcd whether or not the monitoring stack is present. The only thing missing without the monitoring stack and in case of approach 1 is the time series data (overall utilization trending graph) rest everything is in etcd and only needs to be aggregated. I see the difference b/w 2 appraoches as the share of responsibility b/w UI and monitoring. So yes approach 2 is feasible but only thing is if monitoring stack is not present approach 2 effectively either becomes approach 1 or the dashboard is blank. So, please suggest which module implements this unified API whether it would be performance-monitoring as in approach 2 or tendrl-api (which I think might be the best place.... as it would then serve both cases whether monitoring is enabled or not)

Ceph Cluster Dashboard data-points:

Mock up link:

Ceph cluster dashboard

Data-Points with their currently available sources or work that needs to be done to implement it are as under:

Utilization card:

This card contains the following details
- Cluster utilization:
- Available: This data is already made available in etcd @ /clusters/{cluster-id}/Utilization) by ceph-integration. This is also made available as part of GetClusterList api.
- Cluster Utilization by storage profile:
- Work involved:
- Implement the concept of storage profiles in backend that provides a way of grouping disks under different buckets
- OSD utilization needs to be implemented.
- And then effectively, the storage profile utilization for a cluster is the summation of utilizations of all osds under the specific storage profile(Assumption needs validation).
Host status wise counts:

Details This card contains no. of hosts, no. of hosts that are down, and no.of alerts on this node.
- Available:
- The nodes are already maintained in etcd @ '/nodes'. This is also exposed via GetNodeList api
- For the number of critical alerts or the number of minor alerts. We already have the critical and warning threshold breaches for node specific resources like cpu, memory, swap and mount point utilizations.
- Work involved:
- Total no.of nodes is finding a counter of them
- Add a counter of no. of alerts for a node.
- Hosts status needs to be implemented in backend which is probably some kind of heartbeat between node-agents.
Monitors status wise counts:

Details
- Available:
- List of mons are maintained in etcd(/clusters/{cluster-id}/maps/mon_map/data/mons)
- List of mons out of quorum are maintained in etcd @ /clusters/{cluster-id}/maps/mon_status/data/outside_quorum.
- Work involved:
- Count number of mons in etcd.
- So, count of number of mons that are down is Count number of mons out of quorum(Assumption needs validation)
PGs status wise counts:

Details
- Available:
- PG count by status combinations(ex: active+remapped, active+clean, etc...) is stored in etcd @ /clusters/{cluster-id}/maps/pg_summary/data/all.
- PG count by pool is available at /clusters/{cluster-id}/Pools/{pool-id}/pg_num
- Work involved:
- Aggregate PG counts by pool across the cluster for getting pg total count for cluster dashboard.
- A way to analyse the count of PGs in error, warning(degraded) status from the counters by status combinations(ex: active+remapped, active+clean, etc...) needs to be known.
- @anivargi can you please confirm if this data is currently exposed via api
Osds status wise counts:

Details
- Available:
- '/clusters/{cluster-id}/maps/osd_map/data' in etcd provides the osd statuses and also other details about the osds.
- Work involved:
- Counter of number of osds in above etcd path needs to be calculated.
- Counter of number of osds by status maintained at above path in etcd needs to be calculated.
- @anivargi Please confirm if this data is currently exposed via api
Pools:

Details
- Available:
- Pool utilizations are made available as part of /clusters/{cluster-id}/Pools/{pool-id} in etcd and this also made available as part of api to get cluster details
- Pool busiest(IOPS) is available as part of a collectd plugin @ https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_pool_plugin.py
- Work involved:
- Sort pools in the list by utilization and pick the top 5 among them to display in dashboard. @anivargi Please confirm if there is an api for this.
- The framework for configuring ceph specific collectd plugin @ https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_pool_plugin.py needs to be built into performance-monitoring application. And also the performance monitoring syncing this to a location in etcd at regular intervals of time need to be built.
- Source for pool latency needs to be found out. Note: https://github.com/rochaporto/collectd-ceph provides cluster latency and the approach they use for this can be found @ https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_latency_plugin.py#L22
RBDs:

Details
- Available:
- Rbd utilization is made available as part of GetClusterList api.
- Work involved:
- Source for RBD iops and latency information need to be found out.
- Sort Rbds by utilization and picj top 5 of them for display in dashboard.
System Performance:

Details
- Available:
- The cpu and memory utilization for each node is made available in etcd and the same is exposed via an api by the performance-monitoring module.
- The above mentioned api is also aliased to and exposed to external world by tendrl/api as part of GetNodeList api.
- Work involved:
- @anivargi Please confirm if GetNodeList api is capabale of listing nodes for the specified cluster or if there's an equivalent of this.
IO Trends:

Details
- Available:
- https://github.com/rochaporto/collectd-ceph provides cluster latency and the approach they use for this can be found @ https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_latency_plugin.py#L22 Note: This approach might need to be discussed..
- Collectd provides per disk iops.
- Work Involved:
- A way of getting iops @ cluster level needs to be discussed as whether it is
  - average/summation of iops of every disk of every node in the cluster
  - average/summation of iops of disks contributing to osd of every osd node in the cluster (If we decide to go with this approach, the backend needs to provide the osd to disk mapping I think this is not there currently)
  - a measure provided by ceph readily(if this is the choice, source for such info needs to be found)
- If we decide to go with the latency plugin as mentioned in https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_latency_plugin.py, A way to configure this from performance-monitoring needs to be built.
Throughput Trends:

Details This provides the throughputs of cluster/storage network and replication network and client access heartbeat network. Need more info on this and then we can evaluate the sources of information for this. Note: There is a slight mismatch of this as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/213318455 and its details as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/214068233

Note:

Apart from the points mentioned under "Work Involved", all these data then need to be exposed via api

Gluster Cluster Dashboard data-points:

Mock up link:

Gluster cluster dashboard

Data-Points with their currently available sources or work that needs to be done to implement it are as under:

Utilization card:

This card contains the following details
- Cluster utilization:
- Available: This data is already made available in etcd @ /clusters/{cluster-id}/Utilization) by gluster-integration. This is also made available as part of GetClusterList api.
Host status wise counts:

Details This card contains no. of hosts, no. of hosts that are down, and no.of alerts on this node.
- Available:
- The nodes are already maintained in etcd @ '/nodes'. This is also exposed via GetNodeList api
- For the number of critical alerts or the number of minor alerts. We already have the critical and warning threshold breaches for node specific resources like cpu, memory, swap and mount point utilizations.
- Work involved:
- Total no.of nodes is finding a counter of them
- Add a counter of no. of alerts for a node.
- Hosts status needs to be implemented in backend which is probably some kind of heartbeat between node-agents.
Files

Details
- Work involved:
- This is not there in backend currently and source of this information neeeds to be evaluated.
Services:

Details
- Available:
- /nodes/{node-id}/Service in etcd provides service details of glusterd, tendrl-gluster-integration, tendrl-node-agent, etcd, tendrl-apid
- Work involved:
- Services smbd, nfs and nfs-ganesha in the mock up are not monitored currently.
- A way to aggregate this per node service status @ cluster level needs to be found. i.e, if these services on all nodes of cluster are running then as shown in the mock up a green tick can be displayed. But, if some of them are down on only some of the nodes, will it be like counters indicating number of nodes on which a particular service is up or down... This needs further evaluation..
Clients:

Details
- Work involved:
- As of now nothing related to this is available in backend.
File Shares:

Details
- Available:
- Volume utilizations are made available as part of /clusters/{cluster-id}/Volumes/{volume-id} in etcd and is exposed as part of GetClusterList
- Per node per disk iops are currently available in graphite from collectd.
- Work involved:
- Sort volumes in the list by utilization and pick the top 5 among them to display in dashboard.
- Source for volume latency needs to be found out.
- Source for colume iops needs to be found out.
Bricks:

Details
- Work involved:
- As of now nothing related brick utilization or brick iops or brick latency are available. How fetch/calculate this needs to found...
System Performance:

Details
- Available:
- The cpu and memory utilization for each node is made available in etcd and the same is exposed via an api by the performance-monitoring module.
- The above mentioned api is also aliased to and exposed to external world by tendrl/api as part of GetNodeList api.
- Work involved:
- @anivargi Please confirm if GetNodeList api is capabale of listing nodes for the specified cluster or if there's an equivalent of this.
IO Trends:

Details
- Work Involved:
- A way of getting iops @ cluster level needs to be discussed as whether it is
  - average/summation of iops of every disk of every node in the cluster
  - a measure provided by gluster readily(if this is the choice, source for such info needs to be found)
- Even the meaning and source of latency under "IO Trends" card of gluster cluster dashboard needs to be found out.
Throughput Trends:

Details This provides the throughputs of cluster/storage network and replication network and client access heartbeat network. Need more info on this and then we can evaluate the sources of information for this. Note: There is a slight mismatch of this as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/213318455 and its details as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/214068233

Note:

Apart from the points mentioned under "Work Involved", all these data then need to be exposed via api

@brainfunked @Tendrl/tendrl-core Please provide your inputs/suggestions on this

There is a problem that for configuring the collectd plugins, if we decide to configure the plugins on all nodes, it is an overkill and in ceph's case the commands for getting stats need to be executed only on mons. So, an ideal approach would be to select a node from the group of ideal nodes(mons in ceph's case and all nodes in case of gluster cluster) so that instead of all/some nodes pushing same data to time series db(graphite), it will end up being one node making update to graphite. But the problem here is what happens if the node that is currently configured goes down how we configure some other node in such a case...

CEPH MAIN DASHBOARD Data-points

Mock up link:

Ceph main dashboard

Data-Points with their currently available sources or work that needs to be done to implement it are as under:

Utilization card:

This card contains the following details
- Cluster utilization:
- Available:
  - This data is already made available in etcd @ /clusters/{cluster-id}/Utilization) by ceph-integration.
  - This is also made available as part of GetClusterList api.
  - PR https://github.com/Tendrl/performance-monitoring/pull/65 now adds cluster utilization time series graph
- Work Involved:
  - Aggregate individual ceph cluster utilizations.and maintain this aggregate in etcd as well as push it to graphite periodically.
Clusters card:

Details This card contains the following:
- No. of clusters down
- No. of major/critical alerts of all clusters.
- No. of minor active alerts of all clusters
- No. of clusters that have lost quorum
- Available
- Cluster status is maintained in etcd @ /clusters/{cluster-id}/GlobalDetails/status
- With https://github.com/Tendrl/performance-monitoring/pull/65 and https://github.com/Tendrl/alerting/pull/39 the cluster specific threshold breach alerts(cluster, osd, pool, volume utilization warning and critical threshold breach alerts) are maintained in etcd under '/alerting/clusters/{cluster-id}/' in etcd.
- Work involved:
- Counter of number of alerts under all ceph clusters needs to be found.
- Counter of number of clusters on status (total and down)
- No. of clusters out of quorum(For each cluster, @ /clusters/{cluster-id}/maps/mon_status/data/outside_quorum in etcd, mons out of quorum is maintained. Does this mean counter of clusters with even atleast one mon out of quorum is treated to be out of quorum or is it the number of clusters without even 3 mons up ????)
Hosts Card:

Details This card contains the following
- No. of hosts down
- No. of major or critical alerts of all hosts
- No. of minor active alerts for all hosts
- Available
- Nodes are maintained as part of '/nodes' in etcd and are available as part of GetNodeList API.
- The critical and warning threshold breach alerts are maintained as part of '/alerting/nodes/{node-id}' for each node.
- Work involved:
- There is currently no way the host status is monitored. Node-agents need to heartbeat to determine this... So host down count is currently not available.
- For critical and warning alert counters, the count of alerts in the above mentioned paths for each host needs to be done.
Monitors Card

Details This card contains the following:
- Number of mons down
- Number of critical and warning active alerts across all mons
- Available
- List of mons are maintained in etcd @ /clusters/{cluster-id}/maps/mon_map/data/mons
- List of mons out of quorum are maintained in etcd @ /clusters/{cluster-id}/maps/mon_status/data/outside_quorum
- List of critical and warning threshold breach alerts by node are maintained @ 'alerting/nodes/{node-id}' in etcd.
- Work involved:
- Count of alerts
- Count of mons and count of mons out of quorum(count of mons out of quorum = count of mons down ???)
PGs status wise counts:

Details
- Available:
- PG count by status combinations(ex: active+remapped, active+clean, etc...) is stored in etcd @ /clusters/{cluster-id}/maps/pg_summary/data/all.
- PG count by pool is available at /clusters/{cluster-id}/Pools/{pool-id}/pg_num
- Work involved:
- Aggregate PG counts by pool across clusters for getting pg total count for dashboard.
- A way to analyse the count of PGs in error, warning(degraded) status from the counters by status combinations(ex: active+remapped, active+clean, etc...) needs to be known.
Osds status wise counts:

Details
- Available:
- '/clusters/{cluster-id}/maps/osd_map/data' in etcd provides the osd statuses and also other details about the osds.
- Work involved:
- Counter of number of osds in above etcd path needs to be calculated.
- Counter of number of osds by status maintained at above path in etcd needs to be calculated.
Pools:

Details
- Available:
- Pool utilizations are made available as part of /clusters/{cluster-id}/Pools/{pool-id} in etcd and this also made available as part of api to get cluster details
- Pool busiest(IOPS) is available as part of a collectd plugin @ https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_pool_plugin.py
- Work involved:
- Sort pools in the list by utilization and pick the top 5 among them to display in dashboard.
- The framework for configuring ceph specific collectd plugin @ https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_pool_plugin.py needs to be built into performance-monitoring application(The PR : https://github.com/Tendrl/performance-monitoring/pull/69 does this). And also the performance monitoring syncing this to a location in etcd at regular intervals of time need to be built.
- Source for pool latency needs to be found out. Note: https://github.com/rochaporto/collectd-ceph provides cluster latency and the approach they use for this can be found @ https://github.com/rochaporto/collectd-ceph/blob/master/plugins/ceph_latency_plugin.py#L22
RBDs:

Details
- Available:
- Rbd utilization is made available as part of GetClusterList api.
- Work involved:
- Source for RBD iops and latency information need to be found out.
- Sort Rbds by utilization and picj top 5 of them for display in dashboard.

GLUSTER MAIN DASHBOARD Data-points

Mock up link:

Gluster main dashboard

Data-Points with their currently available sources or work that needs to be done to implement it are as under:

Clusters card:

Details This card contains the following:
- No. of clusters down
- No. of major/critical alerts of all clusters.
- No. of minor active alerts of all clusters
- No. of clusters that have lost quorum
- Available
- Cluster status is maintained in etcd @ /clusters/{cluster-id}/GlobalDetails/status
- With https://github.com/Tendrl/performance-monitoring/pull/65 and https://github.com/Tendrl/alerting/pull/39 the cluster specific threshold breach alerts(cluster and volume utilization warning and critical threshold breach alerts) are maintained in etcd under '/alerting/clusters/{cluster-id}/' in etcd.
- Work involved:
- Counter of number of alerts under all gluster clusters needs to be found.
- Counter of number of clusters on status (total and down)
- No. of clusters out of quorum(For each cluster, @ /clusters/{cluster-id}/raw_map/data in etcd, peers out of cluster are maintained. Does this mean counter of clusters with even atleast one peer out is treated to be out of quorum or is it the number of clusters without even 3 peers up ????)
Utilization card:

This card contains the following details
- Cluster utilization:
- Available:
  - This data is already made available in etcd @ /clusters/{cluster-id}/Utilization) by ceph-integration.
  - This is also made available as part of GetClusterList api.
  - PR https://github.com/Tendrl/performance-monitoring/pull/65 now adds cluster utilization time series graph
- Work Involved:
  - Aggregate individual gluster cluster utilizations.and maintain this aggregate in etcd as well as push it to graphite periodically.
Hosts Card:

Details This card contains the following
- No. of hosts down
- No. of major or critical alerts of all hosts
- No. of minor active alerts for all hosts
- Available
- Nodes are maintained as part of '/nodes' in etcd and are available as part of GetNodeList API.
- The critical and warning threshold breach alerts are maintained as part of '/alerting/nodes/{node-id}' for each node.
- Work involved:
- There is currently no way the host status is monitored. Node-agents need to heartbeat to determine this... So host down count is currently not available.
- For critical and warning alert counters, the count of alerts in the above mentioned paths for each host needs to be done.
Files

Details
- Work involved:
- This is not there in backend currently and source of this information neeeds to be evaluated and gluster-integration needs to add this.
Services:

Details
- Available:
- /nodes/{node-id}/Service in etcd provides service details of glusterd, tendrl-gluster-integration, tendrl-node-agent, etcd, tendrl-apid
- Work involved:
- Services smbd, nfs and nfs-ganesha in the mock up are not monitored currently.
- A way to aggregate this per node service status @ system level needs to be found. i.e, if these services on all nodes of cluster are running then as shown in the mock up a green tick can be displayed. But, if some of them are down on only some of the nodes, will it be like counters indicating number of nodes on which a particular service is up or down... This needs further evaluation..
Clients:

Details
- Work involved:
- As of now nothing related to this is available in backend. This needs to be added by gluster-integration
File Shares:

Details
- Available:
- Volume utilizations are made available as part of /clusters/{cluster-id}/Volumes/{volume-id} in etcd and is exposed as part of GetClusterList
- Per node per disk iops are currently available in graphite from collectd.
- Work involved:
- Sort volumes in the list by utilization and pick the top 5 among them to display in dashboard.
- Source for volume latency needs to be found out.
- Source for volume iops needs to be found out.
Bricks:

Details
- Work involved:
- As of now nothing related brick utilization or brick iops or brick latency are available. How fetch/calculate this needs to found...
IO Trends:

Details
- Work Involved:
- A way of getting iops @ system level needs to be discussed as whether it is
  - average/summation of iops of every disk of every node of every gluster cluster
  - a measure provided by gluster readily for a cluster(if this is the choice, source for such info needs to be found) and then how it is aggregated across all gluster clusters needs to be evaluated.
- Even the meaning and source of latency under "IO Trends" card of gluster cluster dashboard needs to be found out.
Throughput Trends:

Details This provides the throughputs of cluster/storage network and replication network and client access heartbeat network. Need more info on this and then we can evaluate the sources of information for this. Note: There is a slight mismatch of this as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/213318455 and its details as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/214068233

Gluster HOST Dashboard data-points:

Mock up link:

Gluster Host dashboard

Data-Points with their currently available sources or work that needs to be done to implement it are as under:

Summary card:

This card contains the following details:
- Name
- Status
- Cluster
- SELinux Mode
- Available:
  - Name and cluster details are readily available in etcd from /nodes/{node-id}/NodeContext/fqdn and /nodes/{node-id}/DetectedCluster/detected_cluster_id respectively. This might also be made available as part of GetNodeList api(Need confirmation).
- Work involved:
- Status of node -- node-agent hear-beat??? Its not implemented in backend yet..
- SELinux mode-- Not available in backend yet
Utilization card

This card contains:
- CPU utilization
- Memory Utilization
- Swap Utilization
- Storage utilization
- Available:
- Point in time data(Utilizations of cpu, memory and storage) are available using http://:5000/monitoring/nodes/summary?node_ids=['']
- Graphs for utilizations can be obtained using http://:5000/monitoring/nodes///stats where resource_name is one of the following: cpu -> "cpu.percent-user" memory -> "memory.percent-used" swap -> "swap.percent-used"
- Work involved:
- Add storage utilization graph -- 1 day
- Sync swap utilization to node summary -- 0.5 day
Bricks card:
- Available:
- Bricks are maintained as part of /clusters/{cluster-id}/raw_map which provides even the hostname of which the brick is carved out from.
- Brick status is also available in same location
- Work involved:
- Counter of number of bricks node-wise.. As part of node-summary(1 day) - ???
- Brick status wise counts(per node).
Services card:
- Available:
- /nodes/{node-id}/Service provides node-wise services status like tendrl-gluster-integration, tendrl-ceph-integration, tendrl-apid, glusterd, ceph-mon, ceph-osd, tendrl-node-agent and etcd.
- Work involved:
- As in mock-up smbd, nfs and nfs-ganesha are not available
Osds status wise counts:

Details
- Available:
- '/clusters/{cluster-id}/maps/osd_map/data' in etcd provides the osd statuses and also other details about the osds.
- Work involved:
- Counter of number of osds in above etcd path needs to be calculated.
- Counter of number of osds by status maintained at above path in etcd needs to be calculated.
- @anivargi Please confirm if this data is currently exposed via api
System Performance:
- Available:
- IOPS graphs per disk per node is available in graphite
- Colelctd provides ping plugin which can be easily configured for latency
- Work involved:
- What is iops graph @ node-level average of iops of all disks in the node(If yes, 2 days) -- ???
Network Trends:

Details This provides the throughputs of cluster network and public network. Also these are not maintained as part of backend currently.. Need more info on this and then we can evaluate the sources of information for this. Note: There is a slight mismatch of this as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/213318455 and its details as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/214068233

Ceph HOST Dashboard data-points:

Mock up link:

Ceph Host dashboard

Data-Points with their currently available sources or work that needs to be done to implement it are as under:

Summary card:

This card contains the following details:
- Name
- Status
- Cluster
- Role
- SELinux Mode
- Available:
  - Name and cluster details are readily available in etcd from /nodes/{node-id}/NodeContext/fqdn and /nodes/{node-id}/DetectedCluster/detected_cluster_id respectively. This might also be made available as part of GetNodeList api(Need confirmation).
  - Role is also available in etcd..
- Work involved:
- Status of node -- node-agent hear-beat??? Its not implemented in backend yet..
- SELinux mode-- Not available in backend yet
Utilization card

This card contains:
- CPU utilization
- Memory Utilization
- Swap Utilization
- Storage utilization
- Available:
- Point in time data(Utilizations of cpu, memory and storage) are available using http://:5000/monitoring/nodes/summary?node_ids=['']
- Graphs for utilizations can be obtained using http://:5000/monitoring/nodes///stats where resource_name is one of the following: cpu -> "cpu.percent-user" memory -> "memory.percent-used" swap -> "swap.percent-used"
- Work involved:
- Add storage utilization graph -- 1 day
- Sync swap utilization to node summary -- 0.5 day
OSDs card:
- Available:
- OSD details are maintained as part of /clusters/{cluster-id}/maps/osd_map/data which provides even the hostname of which the osd is carved out from.
- OSD status is also available in same location
- OSD utilization along with thresholding is also available
- Work involved:
- Counter of number of osds node-wise.. As part of node-summary - ???
- OSD status wise counts as part of node-summary???.
- Alert count for osd
Services card:
- Available:
- /nodes/{node-id}/Service provides node-wise services status like tendrl-gluster-integration, tendrl-ceph-integration, tendrl-apid, glusterd, ceph-mon, ceph-osd, tendrl-node-agent and etcd.
- Work involved:
- As in mock-up smbd, nfs and nfs-ganesha are not available
System Performance:
- Available:
- IOPS graphs per disk per node is available in graphite
- Colelctd provides ping plugin which can be easily configured for latency
- Work involved:
- What is iops graph @ node-level average of iops of all disks in the node(If yes, 2 days) -- ???
Network Trends:

Details This provides the throughputs of cluster network and public network. Also these are not maintained as part of backend currently.. Need more info on this and then we can evaluate the sources of information for this. Note: There is a slight mismatch of this as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/213318455 and its details as in https://redhat.invisionapp.com/share/589XIRJBW#/screens/214068233

Note:

Apart from the points mentioned under "Work Involved", all these data then need to be exposed via api

Tendrl / specifications

Monitoring: Dashboard data-points #145

Mock-up links:

Observations:

Missing data-points:

Approach to be taken for fetching data points in dashboard

1. Each card makes a separate api call to fetch its specific information.

Advantage:

2. performance-monitoring exposes a single point api that provides all dashboard specific data(only point in time stats and counters) in one query.

Advantage:

Disadvantage:

Ceph Cluster Dashboard data-points:

Mock up link:

Utilization card:

Host status wise counts:

Monitors status wise counts:

PGs status wise counts:

Osds status wise counts:

Pools:

RBDs:

System Performance:

IO Trends:

Throughput Trends:

Note:

Gluster Cluster Dashboard data-points:

Mock up link:

Utilization card:

Host status wise counts:

Files

Services:

Clients:

File Shares:

Bricks:

System Performance:

IO Trends:

Throughput Trends:

Note:

CEPH MAIN DASHBOARD Data-points

Mock up link:

Utilization card:

Clusters card:

Hosts Card:

Monitors Card

PGs status wise counts:

Osds status wise counts:

Pools:

RBDs:

GLUSTER MAIN DASHBOARD Data-points

Mock up link:

Clusters card:

Utilization card:

Hosts Card:

Files

Services:

Clients:

File Shares:

Bricks:

IO Trends:

Throughput Trends:

Gluster HOST Dashboard data-points:

Mock up link:

Summary card:

Utilization card

Bricks card:

Services card:

Osds status wise counts:

System Performance:

Network Trends:

Ceph HOST Dashboard data-points:

Mock up link:

Summary card:

Utilization card

OSDs card:

Services card:

System Performance:

Network Trends:

Note: