freeCodeCamp / open-api

freeCodeCamp's open-api Intiative
BSD 3-Clause "New" or "Revised" License
88 stars 28 forks source link

Set up SNS notifications for critical alarms #109

Open ojongerius opened 6 years ago

ojongerius commented 6 years ago

Definition of done: critical alerts create a phone call to team members.

This could be possible by having critical alarms firing of separate SNS topics that have a Twilio webhook as subscriber.

I've seen people create Lambas to connect to Twilio when they fire, but that kind of defeats the purpose, we want to know when Lambdas are on 🔥

Warning: this will be less sophisticated than services like Pagerduty, VictorOps etc, having a schedule, and escalations is well out of scope for this issue.

/cc @freeCodeCamp/open-api did I miss anything, and concerns? Is this a blocker for our first release?

QuincyLarson commented 6 years ago

@ojongerius I just set up UptimeRobot which has SMS notifications without the need for Twilio. It polls all our services once a minute and if any of them are down, it will email us and also it can send an SMS notification. It's easy to configure and I've already set it up for me and Stuart to get texts.

Here's our new status page: https://status.freecodecamp.org

What do you think of this service? Do you think it can be a replacement for PagerDuty, etc.? Will there still be significant benefit to configuring Cloudwatch and Twilio?

ojongerius commented 6 years ago

@QuincyLarson I can think of scenarios where your casual polling will succeed, but service is impaired for other type of requests. Having said that I've caught many issues with simple scheduled end to end tests, that would have gone under the radar of specific monitors on metrics and unit tests.

I would not see it as a replacement, but a great addition 💯

re: https://status.freecodecamp.org is down for me at the moment?

â–¶ wget https://status.freecodecamp.org/
--2018-05-11 11:20:04--  https://status.freecodecamp.org/
Resolving status.freecodecamp.org (status.freecodecamp.org)... 69.162.67.140
Connecting to status.freecodecamp.org (status.freecodecamp.org)|69.162.67.140|:443... failed: Operation timed out.
Retrying.

--2018-05-11 11:21:21--  (try: 2)  https://status.freecodecamp.org/
Connecting to status.freecodecamp.org (status.freecodecamp.org)|69.162.67.140|:443...
QuincyLarson commented 6 years ago

@ojongerius Yes - I agree that there are plenty of corner cases that justify us having a more robust solution.

Not sure why you weren't able to hit the status page, but it's up now:

FreeCodeCamp➜~» wget https://status.freecodecamp.org/                                                                                           [17:46:26]
--2018-05-12 17:46:30--  https://status.freecodecamp.org/
Resolving status.freecodecamp.org... 69.162.67.141
Connecting to status.freecodecamp.org|69.162.67.141|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13053 (13K) [text/html]
Saving to: 'index.html'

index.html                              100%[===============================================================================>]  12.75K  --.-KB/s   in 0.04s  

2018-05-12 17:46:31 (320 KB/s) - 'index.html' saved [13053/13053]
ojongerius commented 6 years ago

Just noticed that SNS has supported SMS via SNS since 2016 ..

QuincyLarson commented 6 years ago

@ojongerius Awesome - so it doesn't require Twilio integration? We could use it for messaging when we have outages?

ojongerius commented 6 years ago

That's right. Unless AWS is down... So there still is a strong use case for external monitoring that includes alerting.

QuincyLarson commented 6 years ago

@ojongerius Yes - but if AWS goes down there isn't a lot we can do anyway. It's gone down what - 4 or 5 times in 10 years?