ashishrajsrivastava / az-400-shared

We will share az 400 training sessions content here
MIT License
0 stars 0 forks source link

Develop a Site Reliability Engineering (SRE) strategy (5-10%) #2

Open ashishrajsrivastava opened 4 years ago

ashishrajsrivastava commented 4 years ago

Develop an actionable alerting strategy

• identify and recommend metrics on which to base alerts • implement alerts using appropriate metrics • implement alerts based on appropriate log messages • implement alerts based on application health checks • analyze combinations of metrics • develop communication mechanism to notify users of degraded systems • implement alerts for self-healing activities (e.g. scaling, failovers)

Design a failure prediction strategy

• analyze behavior of system with regards to load and failure conditions • calculate when a system will fail under various conditions • measure baseline metrics for system • recommend the appropriate tools for a failure prediction strategy

Design and implement a health check

• analyze system dependencies to determine which dependency should be included in health check • calculate healthy response timeouts based on SLO for the service • design approach for partial health situations • integrate health check with compute environment • implement different types of health checks (liveness, startup, shutdown)

ashishrajsrivastava commented 4 years ago

@TheAzureGuy007 Please assign a label to this issue indicating how many days it will take to cover these issues. One day session will be 3hours. So if think the content is covered in 6 hours then you will assign 2days label and so on. Feel free to create labels according to your time expectation for the module.