Closed yaelberger-commits closed 10 months ago
Hey team! Please add your planning poker estimate with ZenHub @andrewleith @jimleroyer @jzbahrai @sastels
Duplicate of #272 so closing 272
@patheard Can you let us know if this is covered by the logs we send to CCCS or if we still need to do more to tackle this? Thanks
My understanding of this issue is that it would be more for our internal detection of issues, giving us a chance to fine tune the alerts on what we're interested in catching. Off the top of my head, it would be things like:
Happy to chat more and brainstorm on what abnormal behaviour looks like and how we could start detecting it.
+1 to what Pat mentioned. I'd start with those and then update your user story template to include a line related to:
Added two above bullet points to user Story template as per Mohamed's suggestion
Description
1- Assess what we currently have in the logs and have the SRE team write queries on their end to trigger certain alarms. 2- Identify missing gaps of what we'd like to have alarms on, implement on Notify side and then SRE team create new alarms. Doing it with a soft step first (i.e #1) would align us better between the two teams, on the overall technical and business requirements. For example, it's not clear the format they expect and if we'd need to massage existing logs (and which ones to create). Can they take structured events too? on top of unstructured data, i.e. logs? etc
First, talk to SRE and trigger alarms and work our way through that. Do not follow the ideas section until after!
Ideas
Anomoly detection of metrics - Admin usage - other use cases that PAT mentioned Pipe to cloud watch and then see what works
Can this detect the difference between our services?
Brainstorm ideas for metric detection:
What are the limits? If we trigger alarms over 80%? What is 80%?
Acceptance Criteria** (Definition of done)
Pick one metric and do the below:
QA Steps