Closed WadeBarnes closed 1 year ago
@rajpalc7, In the few namespaces that define some of these alerts I'm noticing in some cases the alerts are defined more then once for a given namespace, for example twice for the dev namespace. Is it possible to define the alerts once so they cover dev
, test
, and prod
? This would be way easier to maintain.
@rajpalc7 could you please add a comment with the current status for this issue, and what is preventing it from being completed? Thank you!
@esune - I am researching regarding the necessary alerts and triggering alerts we require for our project when any pods, nodes and performance are affected that should trigger and give us an alert in sysdig channels. So far 6 alerts have set-up up and tested successfully and i am working on setting up two more Once thats completed for a99fd4-teams, I will be able to clone these alerts for other projects and mark this ticket as completed
Sorry @WadeBarnes and @esune - to not give you guys enough updates and context regarding this ticket. So far I have been call with sysdig support engineers from past few days working 10:30-12am at night and so far they have themselves not figured out why sysdig is not able to trigger two alerts related to container creating
alert and pod terminated error
. They have told me they will need some time to re-produce this error in their lab and get back to me.
They also sent me an article which was not relevant to us as our cluster is on 4.11 . I have attached few screen shots to explain the on going conversation with sysdig support engineers.
I have also tried to discuss this issue with Shelly from platform team but she has asked me to contact Dustin (Sysdig engineers) and dustin has opened a support ticket and sent me to engineers working in India and Singapore. Currently i am also working at nights and co-ordinating with Singapore engineer Giri and Thomas to resolve this issue ASAP. I was hoping that this issue would be resolved by this week but unfortunately its keeps dragging on.
Thanks for the update Raj, this helps a LOT. It's good to keep your progress up to date on the tickets so everyone is aware of what's going on.
Sounds good Wade, I will try to keep our tickets up to date regularly.
Followed up with Sysdig Support today, waiting for their response
Blocked on waiting for a response from Sysdig to resolve the issue
Did chat with Sysdig Engineer yesterday, the issue is fixed in prod now. Will do some more testing in our environment regarding this today.
All the required alerts have been set-up successfully now.
@rajpalc7, Moving this back to in progress. Following the platform updates we have several pods that were left in a non-running state.
For example: a99fd4-dev - backup-postgres-18-scp28 - CreateContainerConfigError a99fd4-test - backup-postgres-12-hd2pj - CreateContainerConfigError a99fd4-prod - backup-postgres-26-6qsct - CreateContainerConfigError
Looking at the SysDig alerts for a99fd4
it appears the alerts may already be defined (not confirmed), but they are disabled, so we won't get any notifications when they are triggered.
Please review all alerts in all environments and ensure the necessary alerts are defined, enabled, and have notifications configured.
There are NO alerts defined for the Trust Over IP environments e79518
. We had the same issue there, pods were stuck in a non-running state.
No container alerts defined in the Shared Service environment 4a9599
.
There are NO alerts defined for the Monitoring Services environment ca7f8f
:
Sounds good Wade, I thought you might test the alerts in a99fd4 namespace and if you are happy with them I can add all those alerts to other projects.
Those alerts were disabled because they were creating too much noise after testing. I will enable them now.
I have added alerts to all the projects now. looks like that is already triggering few alerts.
Thanks @rajpalc7. Some of the alerts were triggering because the wait time (for the duration of
time) was set to zero which is too low and causes false positives. I've adjusted the wait time and will monitor the alerts.
Calling this done
Thanks for adjusting the wait time Wade.
Using the example Alert Dustin created for us (https://app.sysdigcloud.com/#/alerts/rules?alertId=14505300&direction=asc&sortBy=name). Add similar alerts to all monitored namespaces and all environments.
We want to be alerted when the containers are in any of the following states for too long:
Also have a look at some of the other alerts in the Alerts > Library > Kubernetes section. Ones of interest are: