Open ashishrajsrivastava opened 4 years ago
@TheAzureGuy007 Please assign a label to this issue indicating how many days it will take to cover these issues. One day session will be 3hours. So if think the content is covered in 6 hours then you will assign 2days label and so on. Feel free to create labels according to your time expectation for the module.
Develop an actionable alerting strategy
• identify and recommend metrics on which to base alerts • implement alerts using appropriate metrics • implement alerts based on appropriate log messages • implement alerts based on application health checks • analyze combinations of metrics • develop communication mechanism to notify users of degraded systems • implement alerts for self-healing activities (e.g. scaling, failovers)
Design a failure prediction strategy
• analyze behavior of system with regards to load and failure conditions • calculate when a system will fail under various conditions • measure baseline metrics for system • recommend the appropriate tools for a failure prediction strategy
Design and implement a health check
• analyze system dependencies to determine which dependency should be included in health check • calculate healthy response timeouts based on SLO for the service • design approach for partial health situations • integrate health check with compute environment • implement different types of health checks (liveness, startup, shutdown)