bcgov / DITP-DevOps

Digital Identity and Trust Program Team's DevOps Documentation Repository
Apache License 2.0
2 stars 6 forks source link

Add Sysdig Container Alerts to all monitored namespace and all environments #48

Closed WadeBarnes closed 1 year ago

WadeBarnes commented 1 year ago

Using the example Alert Dustin created for us (https://app.sysdigcloud.com/#/alerts/rules?alertId=14505300&direction=asc&sortBy=name). Add similar alerts to all monitored namespaces and all environments.

We want to be alerted when the containers are in any of the following states for too long:

Also have a look at some of the other alerts in the Alerts > Library > Kubernetes section. Ones of interest are:

WadeBarnes commented 1 year ago

@rajpalc7, In the few namespaces that define some of these alerts I'm noticing in some cases the alerts are defined more then once for a given namespace, for example twice for the dev namespace. Is it possible to define the alerts once so they cover dev, test, and prod? This would be way easier to maintain.

esune commented 1 year ago

@rajpalc7 could you please add a comment with the current status for this issue, and what is preventing it from being completed? Thank you!

rajpalc7 commented 1 year ago

@esune - I am researching regarding the necessary alerts and triggering alerts we require for our project when any pods, nodes and performance are affected that should trigger and give us an alert in sysdig channels. So far 6 alerts have set-up up and tested successfully and i am working on setting up two more Once thats completed for a99fd4-teams, I will be able to clone these alerts for other projects and mark this ticket as completed

rajpalc7 commented 1 year ago

Sorry @WadeBarnes and @esune - to not give you guys enough updates and context regarding this ticket. So far I have been call with sysdig support engineers from past few days working 10:30-12am at night and so far they have themselves not figured out why sysdig is not able to trigger two alerts related to container creating alert and pod terminated error . They have told me they will need some time to re-produce this error in their lab and get back to me.

image

rajpalc7 commented 1 year ago

They also sent me an article which was not relevant to us as our cluster is on 4.11 . I have attached few screen shots to explain the on going conversation with sysdig support engineers.

image

rajpalc7 commented 1 year ago

I have also tried to discuss this issue with Shelly from platform team but she has asked me to contact Dustin (Sysdig engineers) and dustin has opened a support ticket and sent me to engineers working in India and Singapore. Currently i am also working at nights and co-ordinating with Singapore engineer Giri and Thomas to resolve this issue ASAP. I was hoping that this issue would be resolved by this week but unfortunately its keeps dragging on.

WadeBarnes commented 1 year ago

Thanks for the update Raj, this helps a LOT. It's good to keep your progress up to date on the tickets so everyone is aware of what's going on.

rajpalc7 commented 1 year ago

Sounds good Wade, I will try to keep our tickets up to date regularly.

rajpalc7 commented 1 year ago

Followed up with Sysdig Support today, waiting for their response

esune commented 1 year ago

Blocked on waiting for a response from Sysdig to resolve the issue

rajpalc7 commented 1 year ago

Did chat with Sysdig Engineer yesterday, the issue is fixed in prod now. Will do some more testing in our environment regarding this today.

rajpalc7 commented 1 year ago

All the required alerts have been set-up successfully now.

WadeBarnes commented 1 year ago

@rajpalc7, Moving this back to in progress. Following the platform updates we have several pods that were left in a non-running state.

For example: a99fd4-dev - backup-postgres-18-scp28 - CreateContainerConfigError a99fd4-test - backup-postgres-12-hd2pj - CreateContainerConfigError a99fd4-prod - backup-postgres-26-6qsct - CreateContainerConfigError

Looking at the SysDig alerts for a99fd4 it appears the alerts may already be defined (not confirmed), but they are disabled, so we won't get any notifications when they are triggered.

image

Please review all alerts in all environments and ensure the necessary alerts are defined, enabled, and have notifications configured.

WadeBarnes commented 1 year ago

There are NO alerts defined for the Trust Over IP environments e79518. We had the same issue there, pods were stuck in a non-running state.

image

WadeBarnes commented 1 year ago

No container alerts defined in the Shared Service environment 4a9599. image

WadeBarnes commented 1 year ago

There are NO alerts defined for the Monitoring Services environment ca7f8f: image

rajpalc7 commented 1 year ago

Sounds good Wade, I thought you might test the alerts in a99fd4 namespace and if you are happy with them I can add all those alerts to other projects.

Those alerts were disabled because they were creating too much noise after testing. I will enable them now.

rajpalc7 commented 1 year ago

I have added alerts to all the projects now. looks like that is already triggering few alerts.

WadeBarnes commented 1 year ago

Thanks @rajpalc7. Some of the alerts were triggering because the wait time (for the duration of time) was set to zero which is too low and causes false positives. I've adjusted the wait time and will monitor the alerts.

WadeBarnes commented 1 year ago

Calling this done

rajpalc7 commented 1 year ago

Thanks for adjusting the wait time Wade.