Open dlmarion opened 1 year ago
Ideas from the discussion regarding this topic:
(2) allows us to keep some of the same functionality that the Monitor provides (REST endpoints, graphs, etc.) and would allow us to remove all of the special metric handling in the servers that support the monitor.
I wouldn't want to rely on InfluxDB as a dependency. But, I think that we could configure a lightweight aggregation service for micrometer that can be configured in the metrics configuration, that simply sends metrics to the monitor via a REST endpoint, so the monitor can draw graphs using that. That is how the monitor log aggregation works today (in 2.1), and I think it works well. Alternatively, we just get rid of the graphs on the monitor, but I think people would miss them, because they probably do add some value, albeit not the same level of value as a full fledged metrics aggregation service.
The following comment is an attempt to summarize a discussion around this issue.
The manager currently has as custom process that periodically contacts all tservers via a thrift method call to obtain metrics and check if the tserver is alive. In addition to removing the custom thrift metrics, the custom liveness check could also be removed and replaced with documentation on how to use metrics to check process health. Then uses could kill unhealthy processes in many different system dependent ways. This would likely be much more effective at keeping an Accumulo instance healthy.
Old issue that touched on metrics for monitoring - issue #946
The Monitor sends an RPC request to the Manager (getManagerStats()) to get information about the cluster. This information is aggregated and maintained by the Manager as it is getting heartbeat information from the TabletServers and performing it's management functions. In the case where we would want multiple Manager, or to simplify the Manager codebase we could remove this functionality and just have the Monitor use the metrics.
The Accumulo Monitor server could display graphs that are generated from the metrics that we export via Micrometer. For example, the server processes could be configured to export their metrics to an InfluxDB server and the Monitor could query that InfluxDB server for getting the data to generate the graphs. If an InfluxDB server is not configured, a small in-memory one could be started when the Monitor is started.