Today MCM exposes metrics which has a few shortcomings:
Metrics do not follow the best practice/recommendations from Prometheus (Refer to this and this). We need to relook at the metrics and the labels that are used on them.
Contextual information is missing on metrics which prevents from correlating different metrics captured across different mcm and mcm-provider functions/Provider-API calls.
While we recommend to re-look at all the metrics but we also had some concrete improvements for 2 metrics that got recently introduced:
Provider API metrics:
APIRequestDuration:
For this metric we propose to add additional labels which capture the following:
Provider API Operation that is invoked. Today we use service to capture that but we should relook at renaming this.
Driver Operation under which the provider API is invoked.
Machine name for which this API request is made
MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.
DriverAPIRequestDuration:
For this metrics we propose to add additional labels which capture the following:
Driver Operation under which the provider API is invoked.
Machine name for which this API request is made
MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.
How to categorize this issue?
/area control-plane /area monitoring /kind enhancement /priority 3
What would you like to be added:
Today MCM exposes metrics which has a few shortcomings:
While we recommend to re-look at all the metrics but we also had some concrete improvements for 2 metrics that got recently introduced:
Provider API metrics:
APIRequestDuration: For this metric we propose to add additional labels which capture the following:
service
to capture that but we should relook at renaming this.DriverAPIRequestDuration: For this metrics we propose to add additional labels which capture the following:
Provider Implementations:-
Why is this needed:
This allows us to observe metrics at different levels: