Closed SleepyBrett closed 6 years ago
Hey @SleepyBrett, thank you for opening this. We have identified this issue recently and fixed it - The problem was a timeout that was too short. This fix https://github.com/DataDog/integrations-core/pull/1399 will be embedded in the next version of the agent (6.2) that will be released within 2 weeks.
Two weeks seems like a long time to go without any kube state metrics.
We have a release cycle of 6 weeks for the agent version 6. I apologise if that causes any issues on your end.
As we are currently in the QA phase of the release, you can temporarily use the release candidate.
datadog/agent:6.2.0-rc.1
that embeds this fix, or the even more recent ( datadog/agent-dev:6-2-0-rc-2).
I hope that can be a solution for you.
Respectfully, bug fixes aren't features. I shouldn't have to swallow untested pre-release features to get a bugfix.
Closing this as it's fixed in 6.2.0 that was released a few days ago.
Output of the info page (if this is a bug)
Describe what happened:
I run a moderately large cluster of 46 nodes (37 true workers, m4.10xl, about 2900 pods ). I've installed your agent (6.1.4) using your stable helm chart as a ds. I've removed all resource limits (assuming at first that it was cpu throttling) and that got it to be able to scape metrics some of the time, but I'm seeing large numbers of timeout errors suggesting that we might be on the edge.
When I get into another container and curl the endpoint it responds within about a second, curl timing log:
Describe what you expected: I expect ksm metrics to find their way to dd, on my largest cluster this has been problematic. I expect there may be a env variable or config parameter to tweak this timeout, but I can't find it in the documentation.
Steps to reproduce the issue:
Additional environment details (Operating System, Cloud provider, etc): Kubernetes 1.9.6 on aws, approx 2900 pods on m4.10xl nodes
full log from dd-agent on the same node as ksm: