Sysdig - troubleshoot issue missing kube_ metircs

BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)

Apache License 2.0

8 stars 17 forks source link

Sysdig - troubleshoot issue missing kube_ metircs #3044

Closed ShellyXueHan closed 1 year ago

ShellyXueHan commented 2 years ago

Describe the issue Since silver ocp upgrade started, sysdig is not getting an kube_ metrics. Investigate into what happened.

What is the Value/Impact?

What is the plan? How will this get completed? What are the key components of this task?

Identify any dependencies Internal, external, who to contact in case of absence, person to refer to for further help.

Definition of done

[x] find root cause
[x] document
[x] prevent it happen again

ShellyXueHan commented 2 years ago

Update:

The issue was gone after cluster upgrade is done. We wouldn't find specific log information about the issue.

Sysdig support thinks that the cluster upgrade operation may have put the promscrape endpoint into an unhealthy state which was interfering with the reporting of these metrics.

We are leaving the case open for another few days to see if that issue pop up again.

ShellyXueHan commented 2 years ago

Update:

the metrics are not missing but still broken. Working with Sysdig support to get more information on this still!

ShellyXueHan commented 2 years ago

Update:

Tested the latest version in klab which seemed to fix the issue, but the upgrade didn't work in silver. Still investigating!

ShellyXueHan commented 2 years ago

Update:

it looks like there are quite some slowness happening with getting the metrics over network. Still looking into more logs and system output at the moment.

ShellyXueHan commented 1 year ago

Update:

we are seeing a high amount of syscalls from sdn container, we will try the following to resolve it:

upgrade sysdig to 12.11.0 which has CPU utilization improvement - #3547
if that doesn't help, we will need to filter out some syscall from sysdig agent pod

ShellyXueHan commented 1 year ago

Updates:

What we've found out is that we have a very high amount of dropped syscalls from sysdig as the agent pods get very busy. This will affect how metrics data are passed over to sysdig cloud and result in some of them missing. To resolve the issue, we've upgraded to Sysdig 12.12.0 which contains performance improvement. Now we are seeing a way lower dropped rate, and sample ratio is around 1 again. Sysdig support has setup monitoring on the dropped syscalls and we shall get notified in early stage if this were to happen again.

ShellyXueHan commented 1 year ago

Updates:

The dropped syscalls are currently sitting around 10,000 and 200,000 syscalls (average) for silver. It's imperative to note that while this may sound high, within this range, it's possible that it is not impacting the environment or losing metrics themselves.

Sysdig support team investigated and confirmed that while there are still "some" dropped syscalls within this range, they did not see missing data or gaps in any graphs. This could be considered as a baseline considering how busy Silver cluster is.

There is alerts setup for dropped call above 1M, where this was the level previously when causing problems.