aws / eks-pod-identity-agent

Apache License 2.0
65 stars 11 forks source link

Add metrics with, at least, success rate #13

Open abhishekmukherg opened 3 months ago

abhishekmukherg commented 3 months ago

Hi! We're interested in onboarding Pod Identity for our clusters. As we're planning out our installation, we feel a lack of observability into the agent, which may effect our ability to operate the system at scale. If I'm reading the code right, it appears that the only signals we can get, as consumers of the agent, are largely the /healthz and /readyz endpoints (both of which lead to the same probe).

Given the criticality of the system as we onboard it, it would be valuable for us to get one further level of detail. I'm thinking in the best case would be the ability to get success rate per agent running (since, if I understand the code, it seems like it's largely a HTTP service).

One thing we could implement would be a simple Prometheus/OpenMetrics endpoint which could expose just simple 200/300/400/500s (per the default go prometheus client), and that would give us the lion's share of what we need out of the observability story. It could go deeper into other facets, but... baby steps ;). If we had some confidence that the base metrics could be integrated upstream, it's possible we could take on this work to implement it.

Alternatively, these metrics could go to CloudWatch or something, but that's more of a new area for me so don't know what that'd look like.

abhishekmukherg commented 3 months ago

One open question that I don't have the opportunity to look up right this moment, but may need to be solved, is if the monitoring endpoints can be exposed to a wide enough interface/port to actually be monitorable by Prometheuses

prateekgogia commented 2 months ago

One open question that I don't have the opportunity to look up right this moment, but may need to be solved, is if the monitoring endpoints can be exposed to a wide enough interface/port to actually be monitorable by Prometheuses

We should be able to scrape metrics from this agent through APIServer -> kubelet -> pod identity agent.

abhishekmukherg commented 2 months ago

Excellent, thank you for the response. We'll keep this ticket updated as we approach this. It's looking like it'll be around Sept-Oct timeframe that we'll be able to pick up the work

pkruk commented 1 month ago

Hi :) I'm also missing this one :) I was wondering if exposing prometheus endpoint is ok for you? I prepared a wip here :) If that's direction is good for you I could implement the rest of logic :)