apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 446 forks source link

Modify MetricsProducer javadoc table #4815

Closed dlmarion closed 2 weeks ago

dlmarion commented 2 months ago

The javadoc table in MetricsProducer was added in 2.1 to aid in the metric usage from prior versions. Micrometer replaced Hadoop Metrics2 in version 2.1.0, which changed the format of the metrics. Post 2.1 we can change the javadoc table such that is no longer an aid for metric conversion and more of a user guide to the metrics. The table could be modified to include a description of what the metric indicates, what properties could be related that might influence the behavior of the code being monitored, what conditions to look for, and what actions to take.

ctubbsii commented 2 months ago

Another thing to consider would be to auto-generate markdown for publishing to the docs on the website to document the available metrics on each release, like we do with system and client properties.

DomGarguilo commented 2 months ago

I am trying to think of what the best solution here is. For the amount of metrics that we have, it seems like a table might get too big (have to scroll back up to check the column titles).

Another option would be to have just a "list" of the metrics each with its own header and have the info about each one under that. That way we could also break the metrics up into sections like general server metrics or compactor metrics like we do now with the comments. This might make things more readable.

Here is what that might look like as an example:

Click to expand # General Server Metrics ## `metrics.server.idle` - **Type:** Gauge - **Description:** Indicates if the server is idle (1 = idle, 0 = not idle). - **Related Properties:** - `accumulo.server.idle.timeout`: Influences how long the server waits before becoming idle. Longer timeouts lead to longer periods of non-idleness. - **Conditions to Monitor:** - Extended periods of idleness during high usage might indicate issues. - **Recommended Actions:** - Review system logs for potential issues or unexpected activity. # Compactor Metrics ## `metrics.compactor.majc.stuck` - **Type:** LongTaskTimer - **Description:** Tracks the duration of major compaction tasks that get stuck. - **Related Properties:** - `accumulo.compactor.max.running.tasks`: Influences how many compaction tasks can run concurrently. A higher value could increase the chance of compactions getting stuck under resource pressure. - **Conditions to Monitor:** - Long task durations without completion may indicate resource contention, particularly with disk I/O. - **Recommended Actions:** - Check disk usage and resource allocation. High load systems may require tuning. ## `metrics.compactor.entries.read` - **Type:** FunctionCounter - **Description:** Counts the number of entries read by all threads performing compactions. - **Related Properties:** - `accumulo.compactor.threadpool.size`: Affects how quickly entries can be read. A larger thread pool can speed up the reading process but may consume more system resources. - **Conditions to Monitor:** - Low read count during periods of expected high compaction activity. - **Recommended Actions:** - Ensure that the compactor thread pool is properly sized for the workload. # Fate Metrics ## Metric: `metrics.fate.ops` - **Type:** Gauge - **Description:** Tracks the number of current Fate operations in any state. - **Related Properties:** - `accumulo.fate.max.transactions`: Limits the number of concurrent Fate operations. Higher limits allow for more transactions but may also increase the risk of contention or failure under high load. - **Conditions to Monitor:** - High number of operations in progress could signal stuck or delayed transactions. - **Recommended Actions:** - Investigate if operations are stuck or taking too long to complete.

I am not sure the best way to structure the raw text for this list though. It may end up that having it in a table for development might work better and then that is rendered out to this list. Not too sure.

I am interested in hearing others thoughts on the table vs. list argument though.

keith-turner commented 2 months ago

@DomGarguilo I like the sections with list example. I think this better than a table because it allows more information per metric (like the recommended action and related properties sections). The table view encourage authors to be really terse.

DomGarguilo commented 2 months ago

Another thing to consider would be to auto-generate markdown for publishing to the docs on the website to document the available metrics on each release, like we do with system and client properties.

This is a good idea. I modeled #4850 after the way Property uses an enum to store all the info that is rendered out into a markdown file.

dlmarion commented 2 weeks ago

@DomGarguilo - can this be closed now?

DomGarguilo commented 2 weeks ago

Closed via #4850