gbif / portal16

GBIF.org website
https://www.gbif.org
Apache License 2.0
24 stars 15 forks source link

Idea: System health <-> Statuspage #1060

Open dnoesgaard opened 5 years ago

dnoesgaard commented 5 years ago

Inspired by Datacite, Contentful and many others, I've been playing around with the Statuspage.io service. Unlike our own health status service, Statuspage is mainly meant as a communications tool. It does allow for automated updating, however, also planned maintenance, uptime statistics and notifications to social media and email subscribers.

Do you think it would make sense to somehow combine our automated monitoring with a Statuspage?

(quick intro to Statuspage: https://help.statuspage.io/knowledge_base/topics/statuspage-user-guide)

MortenHofft commented 5 years ago

As far as I remember that is the one Contentful use - i believe I looked at it back when I implemented it. Honestly I cannot remember the reasons not to do more with it. Perhaps you could pop down and explain it at some point? It sounds like you already have implemented something.

MortenHofft commented 5 years ago

I'd be moire than happy to deprecate the one we have. It is a hack in lack of a Nagios api. So if we get a better monitoring with an API and a tool to monitor, then I'm happy. The current status page also cannot show if the site is down - so obviously it isn't ideal

dnoesgaard commented 5 years ago

I'd like to explore this one a bit more as we've been granted a free "open source" license for Statuspage. Whether we need another service to do the actual monitoring, I'll leave up to you to consider. However, for purpose of tracking downtime history, notifications and system performance more transparently , it would be really cool if we could use Statuspage as the system health page for GBIF.

Some basic info

Statuspage allows for simple API calls to update the status of "components", e.g.

curl https://api.statuspage.io/v1/pages/tpr241s1tthg/components/h4knn82kpnx4.json -H "Authorization: OAuth xyz" -X PATCH -d "component[status]=major_outage"

Every outage can be accompanied by an "incident" by which we tell people what we know about a certain outage. This will trigger email notifications to subscribers.

Statuspage can also be used to track system metrics, e.g. downloads, ingestion, load level, etc. Once a metric has been defined, we can call their API to tell them about current "load", e.g.

  curl https://api.statuspage.io/v1/pages/tpr241s1tthg/metrics/zzzxrkmyhmxn/data.json -H "Authorization: OAuth xyz -d "data[timestamp]=1550670225" -d "data[value]=n"

(to make stats sensible, they require an update at least every 5 mins)

We might have other monitoring systems in place that we can consider hooking up to Statuspage as well.

Anyway, happy to talk more about this. Perhaps @MattBlissett might also have an opinion?