gbif / gbif-web

Apache License 2.0
6 stars 9 forks source link

Status page #599

Open MortenHofft opened 2 weeks ago

MortenHofft commented 2 weeks ago

Both gbif.org and hosted portals need a way to display the status of our services. Currently hosted portal users report more outages than gbif.org users. Presumably because we do not indicate that we have an outage.

Similar when pages fail, we could tell the user that we are in facet having infrastructure issues at the moment.

Once again I think we should explore the option to have this status exposed via an extrneral statuspage like statuspage.io

mpitblado commented 5 days ago

Hi @MortenHofft

As I just put up a status page for the museum and the hosted portal is included in it, I would be interested how this develops. Currently, I have a simple ping check to the hosted portal homepage, and then I also check https://www.gbif.org/api/health to see if any of the services are reporting CRITICAL. This is displayed on our status page as an "upstream service" that may impact functionality of the portal.

A few weeks ago, there was an issue with the registry service that caused the dataset summary pages to not load properly. On our end, this could have been detected via either monitoring for a keyword on one of the dataset pages (could be "occurrence" for example), or as it actually occurred, I posted a quick explanation and then referred to gbif.org/health .

If there was a hosted portal specific status page, how would this differ from what is available from the current system health page? Just curious as I can perhaps improve some checks!

I also see that the issue referenced in this issue is about us! I was not aware that the user submitted the report, but I did create the status page over the past week to address some emails that mentioned the outage. I have linked our status page at the bottom of our portal, and begun to communicate to our users where they can find it and what to expect from it etc.

MortenHofft commented 5 days ago

I would like to remove this internal endpoint https://www.gbif.org/api/health - it isn't part of any stable APIs. You are welcome to use it of course and I will try to remember to inform you before removing it, but be aware that it is not a intended as a stable endpoint.

I haven't really decided what to do I must admit. But I imagine it would do the same as what you see on gbif.org/health . How we then use that on hosted portals I'm not sure. Could be a notification or a badge like you have nicely done. And then we ought to better show the status when a page fails. Instead of just showing an error I would rather show a message stating that we are offline, or at least to expect unstable services.

How would you like a status page to be integrated on hosted portals?

mpitblado commented 5 days ago

Possible ideas:

Setting up an endpoint for services that returns either HTTP status codes, or contains keywords upon a GET request.

GET https://api.gbif.org/hp-status/datasets
200 OK

or

GET https://api.gbif.org/hp-status/datasets
OPERATIONAL (and returns html/json/xml page with just that word)

For the hosted portal, perhaps the following services are relevant

The service providing the api endpoint should mimic the action that the hosted portals are doing. For example, if the dataset summary pages has to make 4 requests to some service and call for 10 react components, the check should also try to perform those same actions and receive a non-error code for each.

Then individual countries/nodes can setup monitors to send a ping to those endpoints, and the idea below could also pull from them.

Replacing the error message with a component that pulls in information from an endpoint

Currently the error message is purely technical, however it may be possible to load a "error/status component" in the event that the initial component fails. This is similar to how 404 pages will often give some brief information to the user, and then display some simple navigation to get them back on track. The risk is that if the react components and the status component share a common infrastructure, then the failure of one to be delivered will likely result in the failure of the error message to be delivered as well. It doesn't have to be a complex react component, just has to dynamically load the values from calling out to an api endpoint.

Having a monitor for the web server itself would not be beneficial in this case, as if the entire webserver is down, the user can never see this error message.

I put the sentence about the refresh in, because currently it will solve the most common error of the components not loading (see https://github.com/gbif/gbif-web/issues/594 and https://github.com/gbif/gbif-web/issues/429).

Flow: Attempt to load component - (if error) -> Attempt to load status component - (if error) -> Technical error message.

image

Other thoughts