Improve Observability: add support for Metrics and Tracing

afarbos commented 1 year ago

To get proper monitoring of the autoscaler, the tool should

trace of internal processes (scaling decisions,.....) and different calls (http, pubsub, grpc) with the instance tagged
- this would the benefit to get errors rate and latency distributions metrics for those
have metrics for scaling events per instance with the the related GCM (high cpu, storage...) also, both as tags:
- successful scaling count
- scaling "abort" with the reason as a tag (bigger than max size, lower than min size)

I would suggest to use a open source library like OpenTelemetry for this.

henrybell commented 1 year ago

Thanks @afarbos -- I agree that OpenTelemetry would potentially be a good fit for this use-case given its flexibility and integration with multiple backend providers. I will need to do some investigation upfront to make sure that we can introduce this in a way that works across both the Cloud Functions and GKE runtime environments. We would need to be mindful this would have to work with short-lived executions (especially given #141) and within the CF runtime, as well as not negatively impact any existing users that may not want to adopt this capability, so there are some unknowns here. I'll come back to this issue with any questions. Thanks!

henrybell commented 1 year ago

My colleague @nielm has started adding instrumentation via the draft PR #143 for the scaler component, with the plan being to instrument the poller in a similar way. @afarbos if you have any specific ideas for anything else you'd like to see implemented, it would be great if you could please update here with your thoughts and we can discuss -- thanks!

afarbos commented 1 year ago

I think it looks good start overall for the pure metrics side of things but it seems some key tag are missing from my POV that I specified in the issue:

the metrics rule/name used for scaling: high cpu, storage...
In failure case including a reason tag, examples:
- max_scale: https://github.com/cloudspannerecosystem/autoscaler/blob/3acd9a8c200366bea899b57cfc4b04c9d4531980/scaler/scaler-core/scaling-methods/base.js#L49
- poller is missing:
- HTTP / Scaler Call https://github.com/cloudspannerecosystem/autoscaler/blob/master/poller/poller-core/index.js#L231
- Spanner MD https://github.com/cloudspannerecosystem/autoscaler/blob/master/poller/poller-core/index.js#L315,
- GCM https://github.com/cloudspannerecosystem/autoscaler/blob/master/poller/poller-core/index.js#L365
- ...

henrybell commented 1 year ago

Thanks @afarbos! The instrumentation of the poller is planned, and the other metrics you called out here look sensible. I will defer to @nielm for any additional comments on the implementation as it continues, and any potential challenges. Once we have a framework for this instrumentation in place and the initial PR is complete (following some testing), would you be keen to contribute any additional instrumentation?

nielm commented 11 months ago

Updated PR #143 to add poller metrics, to add more information about these metrics and their attributes.

Added attributes for scaling method, and direction.

Added reason attribute for scaling denied (SAME_SIZE, MAX_SIZE, WITHIN_COOLDOWN)

Added overall success/failure metrics for the event processing that catch any unexpected errors

per-instance Scaling or polling failure metrics are to cover unexpected failures (as opposed to Denied which is expected), so need to be analyzed from logging.

nielm commented 10 months ago

Hey @afarbos, We have not submitted this yet because during testing, we ran in to some issues with OpenTelemetry when using short lived processes, which made these counters unreliable:

OpenTelemetry is designed to submit metrics data in the background on a scheduled interval
Any errors during submission are handled silently, and the submission is retried.
OpenTelemetry will creates metrics descriptors on the fly in Google Cloud Monitoring the first time metrics for this counter are actually used.
Google Cloud Monitoring has an issue where the first time a metric for a new metric descriptor is submitted, it will fail to receive it (some internals have to be created on-the-fly I guess). It reports this as a rate-limit error.

The combination of these issues means that the first time a counter is used, it will fail to be submitted to GCP monitoring.

OpenTelemetry will see this failure, silently ignore it, and then resubmit the same metrics at the next scheduled submission.
However, Google Cloud Monitoring has a minimum required interval between submission of the metrics by an application process.
OpenTelemetry can configure this submission interval, but it has to be longer than the submission API timeout (!) so setting it to the minimum interval can cause timeout issues.
There is no way for an application to know when OpenTelemetry has succesfully submitted metrics to Google Cloud Monitoring.

In a long-running process, these issues are not significant - any metrics that fail to be submitted will be re-submitted at the next scheduled time...

However in short-lived processes such as is used in the autoscaler, this will mean that the counters are unreliable, because the process cannot know when the counters have been submitted successfully as this information is hidden by OpenTelemetry, and when the process exits, the metrics information is lost,.

The conclusion is that we cannot use OpenTelemetry to record metrics in the Autoscaler.

We have a workaround, which is to use the Google Cloud Monitoring APIs directly, without using the OpenTelemetry wrappers. In this way we can pre-create all the metrics descriptors, see when a submission fails, and keep the process alive retrying until the submission succeeds.

Is this solution - to write to Cloud Monitoring directly - acceptable for you, given that you suggested using OpenTelemetry?

afarbos commented 10 months ago

I suggested OpenTelemetry because it's open source and vendor neutral. We do not use Cloud Monitoring, if you did that we would need to build a metric forwarder and so making it harder to integrate.

Is this a first time use issue only? I am not sure if I follow if those errors are just misconfiguration based or something else. I would expect the Otel client implementation to block/wait that all the metrics are exported before shutting down.

nielm commented 10 months ago

Is this a first time use issue only? I am not sure if I follow if those errors are just misconfiguration based or something else.

They are not a misconfig - its a combination of how open telemetry sends metrics, and how cloud monitoring handles metrics it has not seen before.

I would expect the Otel client implementation to block/wait that all the metrics are exported before shutting down.

Sadly no, it does not. You can force a flush, but there is no way to tell if it succeeded, and no way to tell if there are metrics waiting to be sent.

nielm commented 8 months ago

Metrics added in #143

cloudspannerecosystem / autoscaler

Improve Observability: add support for Metrics and Tracing #137