Metrics: Eat our own dog food

apache / accumulo

Apache Accumulo

https://accumulo.apache.org

Apache License 2.0

1.07k stars 445 forks source link

Metrics: Eat our own dog food #3649

Open dlmarion opened 1 year ago

dlmarion commented 1 year ago

The Monitor sends an RPC request to the Manager (getManagerStats()) to get information about the cluster. This information is aggregated and maintained by the Manager as it is getting heartbeat information from the TabletServers and performing it's management functions. In the case where we would want multiple Manager, or to simplify the Manager codebase we could remove this functionality and just have the Monitor use the metrics.

The Accumulo Monitor server could display graphs that are generated from the metrics that we export via Micrometer. For example, the server processes could be configured to export their metrics to an InfluxDB server and the Monitor could query that InfluxDB server for getting the data to generate the graphs. If an InfluxDB server is not configured, a small in-memory one could be started when the Monitor is started.

dlmarion commented 1 year ago

Ideas from the discussion regarding this topic:

We could remove the graphs entirely from the Monitor and users could use existing tools (e.g. Grafana)
We could modify the Monitor to contain a metrics sink and buffer about an hour of the metrics considering that the graphs only show an hour of data.

(2) allows us to keep some of the same functionality that the Monitor provides (REST endpoints, graphs, etc.) and would allow us to remove all of the special metric handling in the servers that support the monitor.

ctubbsii commented 1 year ago

I wouldn't want to rely on InfluxDB as a dependency. But, I think that we could configure a lightweight aggregation service for micrometer that can be configured in the metrics configuration, that simply sends metrics to the monitor via a REST endpoint, so the monitor can draw graphs using that. That is how the monitor log aggregation works today (in 2.1), and I think it works well. Alternatively, we just get rid of the graphs on the monitor, but I think people would miss them, because they probably do add some value, albeit not the same level of value as a full fledged metrics aggregation service.

keith-turner commented 1 year ago

The following comment is an attempt to summarize a discussion around this issue.

The manager currently has as custom process that periodically contacts all tservers via a thrift method call to obtain metrics and check if the tserver is alive. In addition to removing the custom thrift metrics, the custom liveness check could also be removed and replaced with documentation on how to use metrics to check process health. Then uses could kill unhealthy processes in many different system dependent ways. This would likely be much more effective at keeping an Accumulo instance healthy.

EdColeman commented 1 year ago

Old issue that touched on metrics for monitoring - issue #946