bcgov / DITP-DevOps

Digital Identity and Trust Program Team's DevOps Documentation Repository
Apache License 2.0
2 stars 5 forks source link

Investigate missing data on sysdig dashboards #24

Closed WadeBarnes closed 1 year ago

WadeBarnes commented 1 year ago

The overwhelming majority of the downtime alerts we receive on our dts-sysdig-alerts channel in Rocket.Chat are due to a minute or so of missing data for the monitored containers.

Please investigate.

1) We need to understand and document why this occurs. I believe there is an explanation we've received previously for why this occurs. 2) Understand and document whether there is anything that can be done to resolve the issue. For example:

WadeBarnes commented 1 year ago

Based on the call with Dustin today there is still work to be done on this ticket. Also, please document some of the findings that have been shared.

rajpalc7 commented 1 year ago

Based on Call with Dustin on Friday (Dec 9th, 2022) we found out that downtime alert is not a good way to determine when any pod is down as that alert is producing results over 100% of the entire timeline instead of just 3 minutes.

Dustin is going to help us setting up a new alert which will show when any pod is down for more than a minute

rajpalc7 commented 1 year ago

All the unwanted Sysdig alerts are dis-abled now and new alerts are created instead. We are no longer receiving the overwhelming majority of the downtime alerts we receive on our dts-sysdig-alerts channel in Rocket.Chat.