Open andor44 opened 5 years ago
While searching through the issue tracker I bumped into https://github.com/DataDog/datadog-agent/issues/2020 which could be related, except the cluster where we have working event collection is actually the cluster with a higher number of events (though not necessarily more stressed) so I'd expect its events endpoint to be busier/take longer.
That 100ms timeout seems awfully low though. Is there no way to override that without having a config dropin (e.g. env var)? 😞
Hey @andor44 - Sorry to hear that you are having this issue.
We are working on improving the Kubernetes event collection as one of the top priority tasks. I am marking this issue so we can track it internally. We will keep you posted as soon as we make progress.
Best, .C
~I tried enabling the cluster agent. At first it also didn't collect events but after raising CPU limits event collection seems to work with the cluster agent. This leads me to believe the issue I'm experiencing is related to #2020 still.~
~I think it might be worth adding a section about the low request timeout for event collection to the readme and/or to the troubleshooting section.~
EDIT: never mind, scratch all that, I was too quick on the trigger. It exhibits the same behavior: event collection works for a few minues then it stops.
Understood, as it uses the same code path I would expect the same behaviour. Thanks for sharing, we will start addressing this week.
Best, .C
Just chiming in that we're also experiencing this issue.
@mshade would you mind opening a ticket with our solutions team, we'd like to gather more details to confirm that. Thank you, and sorry for the headache.
Best, .C
Thanks @CharlyF -- we do have a ticket open, but I hadn't seen the github issue, so I just wanted to ping on it as well.
I have a similar problem with missing kubernetes events, although the agent does not seem to completely stop sending events, but rather just skips the bulk of them. I have opened case 225225.
In case anybody finds this helpful, here is what Datadog support had to say:
We determined that there was a rate limit applied to kubernetes events for your account back in 2019. Specifically, it was limiting 95% of the kubernetes traffic. This was a standard practice at the time- due to excessively high traffic and billing concerns for customers. Back then, Kubernetes events were considered custom events, and therefore customers were charged for them. Now that Kubernetes events are no longer considered custom events, and my colleagues will remove the limit.
...so if anybody else encounters this, open a support ticket and ask for this rate limit to be removed for your account.
Output of the info page (if this is a bug)
Describe what happened: Kubernetes event collection breaks shortly after an agent acquires leader lock. If I kill the agent that holds the leader lock another one acquires it, collects evens for a short while (minutes) but then it also stops reporting k8s events. I can seemingly repeat this any number of times.
Weirdly enough we have another cluster where event collection seems to work fine.
You can see in the above agent output it collected 377 events and nothing on later runs.
Describe what you expected: K8s event collection to work reliably
Steps to reproduce the issue: Datadog deployed with official helm chart (values user: here) on k8s 1.12
Additional environment details (Operating System, Cloud provider, etc): CoreOS Container Linux (latest stable), on-premises