Produce AWS security findings of abnormal behaviour and pipe them to the AWS Landing Zone security hub for the SRE team to detect and respond to

yaelberger-commits commented 2 years ago

Description

1- Assess what we currently have in the logs and have the SRE team write queries on their end to trigger certain alarms. 2- Identify missing gaps of what we'd like to have alarms on, implement on Notify side and then SRE team create new alarms. Doing it with a soft step first (i.e #1) would align us better between the two teams, on the overall technical and business requirements. For example, it's not clear the format they expect and if we'd need to massage existing logs (and which ones to create). Can they take structured events too? on top of unstructured data, i.e. logs? etc

First, talk to SRE and trigger alarms and work our way through that. Do not follow the ideas section until after!

Ideas

Anomoly detection of metrics - Admin usage - other use cases that PAT mentioned Pipe to cloud watch and then see what works

Can this detect the difference between our services?
- If service changes its sending pattern
- Dimensions in metrics
Brainstorm ideas for metric detection:
- API usage
- Service limits?
- Failed login attempts (# of times, per user name)
- Number of times the user tries to reset their password
- Validation errors - the wrong API key being used? Maybe send it to sentenial
What are the limits? If we trigger alarms over 80%? What is 80%?

Acceptance Criteria** (Definition of done)

Pick one metric and do the below:

Define metrics for the above use cases
Define the limit per metric
Send metric results to AWS Landing Zone

QA Steps

[ ] Tested in a realistic production scenario

yaelberger-commits commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @andrewleith @jimleroyer @jzbahrai @sastels

yaelberger-commits commented 2 years ago

Duplicate of #272 so closing 272

yaelberger-commits commented 2 years ago

@patheard Can you let us know if this is covered by the logs we send to CCCS or if we still need to do more to tackle this? Thanks

patheard commented 2 years ago

My understanding of this issue is that it would be more for our internal detection of issues, giving us a chance to fine tune the alerts on what we're interested in catching. Off the top of my head, it would be things like:

the Notify Admin API key suddenly becoming very active;
a large number of new services or user accounts being created; or
a service that starts deviating from its normal send pattern and sending in the middle of the night or to a much larger distribution list.

Happy to chat more and brainstorm on what abnormal behaviour looks like and how we could start detecting it.

mohdnr commented 2 years ago

+1 to what Pat mentioned. I'd start with those and then update your user story template to include a line related to:

Generate appropriate log messages so that executions of this feature can be tracked
Can misuse of this feature cause harm? If yes, create an alert

yaelberger-commits commented 2 years ago

Added two above bullet points to user Story template as per Mohamed's suggestion

cds-snc / notification-planning