Closed afarbos closed 8 months ago
Thanks @afarbos -- I agree that OpenTelemetry would potentially be a good fit for this use-case given its flexibility and integration with multiple backend providers. I will need to do some investigation upfront to make sure that we can introduce this in a way that works across both the Cloud Functions and GKE runtime environments. We would need to be mindful this would have to work with short-lived executions (especially given #141) and within the CF runtime, as well as not negatively impact any existing users that may not want to adopt this capability, so there are some unknowns here. I'll come back to this issue with any questions. Thanks!
My colleague @nielm has started adding instrumentation via the draft PR #143 for the scaler
component, with the plan being to instrument the poller
in a similar way. @afarbos if you have any specific ideas for anything else you'd like to see implemented, it would be great if you could please update here with your thoughts and we can discuss -- thanks!
I think it looks good start overall for the pure metrics side of things but it seems some key tag are missing from my POV that I specified in the issue:
Thanks @afarbos! The instrumentation of the poller
is planned, and the other metrics you called out here look sensible. I will defer to @nielm for any additional comments on the implementation as it continues, and any potential challenges. Once we have a framework for this instrumentation in place and the initial PR is complete (following some testing), would you be keen to contribute any additional instrumentation?
Updated PR #143 to add poller metrics, to add more information about these metrics and their attributes.
Added attributes for scaling method, and direction.
Added reason attribute for scaling denied (SAME_SIZE, MAX_SIZE, WITHIN_COOLDOWN)
Added overall success/failure metrics for the event processing that catch any unexpected errors
per-instance Scaling or polling failure metrics are to cover unexpected failures (as opposed to Denied which is expected), so need to be analyzed from logging.
Hey @afarbos, We have not submitted this yet because during testing, we ran in to some issues with OpenTelemetry when using short lived processes, which made these counters unreliable:
The combination of these issues means that the first time a counter is used, it will fail to be submitted to GCP monitoring.
In a long-running process, these issues are not significant - any metrics that fail to be submitted will be re-submitted at the next scheduled time...
However in short-lived processes such as is used in the autoscaler, this will mean that the counters are unreliable, because the process cannot know when the counters have been submitted successfully as this information is hidden by OpenTelemetry, and when the process exits, the metrics information is lost,.
The conclusion is that we cannot use OpenTelemetry to record metrics in the Autoscaler.
We have a workaround, which is to use the Google Cloud Monitoring APIs directly, without using the OpenTelemetry wrappers. In this way we can pre-create all the metrics descriptors, see when a submission fails, and keep the process alive retrying until the submission succeeds.
Is this solution - to write to Cloud Monitoring directly - acceptable for you, given that you suggested using OpenTelemetry?
I suggested OpenTelemetry because it's open source and vendor neutral. We do not use Cloud Monitoring, if you did that we would need to build a metric forwarder and so making it harder to integrate.
Is this a first time use issue only? I am not sure if I follow if those errors are just misconfiguration based or something else. I would expect the Otel client implementation to block/wait that all the metrics are exported before shutting down.
Is this a first time use issue only? I am not sure if I follow if those errors are just misconfiguration based or something else.
They are not a misconfig - its a combination of how open telemetry sends metrics, and how cloud monitoring handles metrics it has not seen before.
I would expect the Otel client implementation to block/wait that all the metrics are exported before shutting down.
Sadly no, it does not. You can force a flush, but there is no way to tell if it succeeded, and no way to tell if there are metrics waiting to be sent.
Metrics added in #143
To get proper monitoring of the autoscaler, the tool should
I would suggest to use a open source library like OpenTelemetry for this.