cloudspannerecosystem / autoscaler

Automatically scale the capacity of your Spanner instances based on their utilization.
Apache License 2.0
87 stars 34 forks source link

Improve Observability: add support for Metrics and Tracing #137

Closed afarbos closed 8 months ago

afarbos commented 1 year ago

To get proper monitoring of the autoscaler, the tool should

I would suggest to use a open source library like OpenTelemetry for this.

henrybell commented 1 year ago

Thanks @afarbos -- I agree that OpenTelemetry would potentially be a good fit for this use-case given its flexibility and integration with multiple backend providers. I will need to do some investigation upfront to make sure that we can introduce this in a way that works across both the Cloud Functions and GKE runtime environments. We would need to be mindful this would have to work with short-lived executions (especially given #141) and within the CF runtime, as well as not negatively impact any existing users that may not want to adopt this capability, so there are some unknowns here. I'll come back to this issue with any questions. Thanks!

henrybell commented 1 year ago

My colleague @nielm has started adding instrumentation via the draft PR #143 for the scaler component, with the plan being to instrument the poller in a similar way. @afarbos if you have any specific ideas for anything else you'd like to see implemented, it would be great if you could please update here with your thoughts and we can discuss -- thanks!

afarbos commented 1 year ago

I think it looks good start overall for the pure metrics side of things but it seems some key tag are missing from my POV that I specified in the issue:

henrybell commented 1 year ago

Thanks @afarbos! The instrumentation of the poller is planned, and the other metrics you called out here look sensible. I will defer to @nielm for any additional comments on the implementation as it continues, and any potential challenges. Once we have a framework for this instrumentation in place and the initial PR is complete (following some testing), would you be keen to contribute any additional instrumentation?

nielm commented 11 months ago

Updated PR #143 to add poller metrics, to add more information about these metrics and their attributes.

Added attributes for scaling method, and direction.

Added reason attribute for scaling denied (SAME_SIZE, MAX_SIZE, WITHIN_COOLDOWN)

Added overall success/failure metrics for the event processing that catch any unexpected errors

per-instance Scaling or polling failure metrics are to cover unexpected failures (as opposed to Denied which is expected), so need to be analyzed from logging.

nielm commented 10 months ago

Hey @afarbos, We have not submitted this yet because during testing, we ran in to some issues with OpenTelemetry when using short lived processes, which made these counters unreliable:

The combination of these issues means that the first time a counter is used, it will fail to be submitted to GCP monitoring.

In a long-running process, these issues are not significant - any metrics that fail to be submitted will be re-submitted at the next scheduled time...

However in short-lived processes such as is used in the autoscaler, this will mean that the counters are unreliable, because the process cannot know when the counters have been submitted successfully as this information is hidden by OpenTelemetry, and when the process exits, the metrics information is lost,.

The conclusion is that we cannot use OpenTelemetry to record metrics in the Autoscaler.

We have a workaround, which is to use the Google Cloud Monitoring APIs directly, without using the OpenTelemetry wrappers. In this way we can pre-create all the metrics descriptors, see when a submission fails, and keep the process alive retrying until the submission succeeds.

Is this solution - to write to Cloud Monitoring directly - acceptable for you, given that you suggested using OpenTelemetry?

afarbos commented 10 months ago

I suggested OpenTelemetry because it's open source and vendor neutral. We do not use Cloud Monitoring, if you did that we would need to build a metric forwarder and so making it harder to integrate.

Is this a first time use issue only? I am not sure if I follow if those errors are just misconfiguration based or something else. I would expect the Otel client implementation to block/wait that all the metrics are exported before shutting down.

nielm commented 10 months ago

Is this a first time use issue only? I am not sure if I follow if those errors are just misconfiguration based or something else.

They are not a misconfig - its a combination of how open telemetry sends metrics, and how cloud monitoring handles metrics it has not seen before.

I would expect the Otel client implementation to block/wait that all the metrics are exported before shutting down.

Sadly no, it does not. You can force a flush, but there is no way to tell if it succeeded, and no way to tell if there are metrics waiting to be sent.

nielm commented 8 months ago

Metrics added in #143