CCI-Tools / cate

ESA CCI Toolbox (Cate)
MIT License
50 stars 15 forks source link

Reflect availability and health of ODP service #789

Open forman opened 6 years ago

forman commented 6 years ago

Expected behavior

Cate Desktop should "know" if the CCI ODP service is available and should clearly display to to users its health status.

Actual behavior

Users receive error messages when downloading and accessing data (usually connection time-out errors). To users it appears as if Cate was not working correctly.

Steps to reproduce the problem

Download or access ODP data sources, when ODP services are down.

Specifications

Cate 1.0 - 2.0.dev20

forman commented 6 years ago

Added label "external" because resolution requires new ODP web service.

forman commented 6 years ago

Here is some example JSON, that could be returned by the health care service:

{
  "services": {
    "CSW": {
      "status": "OK"
    },
    "WCS": {
      "status": "OK"
    },
    "ESGF": {
      "status": "OK"
    },
    "OPENDAP": {
      "status": "SLOW",
      "reason": "..."
    },
    "HTTP": {
      "status": "OK"
    },
    "FTP": {
      "status": "DOWN",
      "reason": "..."
    }
  },
  "anouncements": [
    {
      "published": "2018-12-06T10:20:13",
      "status": "DOWNTIME",
      "services":  ["CSW"],
      "period":  ["2019-01-01", "2019-01-03"],
      "title": "Catalogue Service Downtime",
      "description": "The ODP CSW will be down from 2019-01-01 to 2019-01-03 for maintenance reasons."
    },
    {
      "published": "2018-11-23T14:06:31",
      "period":  ["2019-02-10", "2019-02-12"],
      "services":  ["OPENDAP", "CSW", "WCS", "ESGF"],
      "status": "LOWBANDWIDTH",
      "title": "Service Migration",
      "description": "All ODP services will be moved to new infrastructure. From 2019-01-01 to 2019-01-03 you may observe low bandwidth."
    }
  ]
}
cpaulcox commented 5 years ago

Is the services section meant to be populated as a result of polling the origin servers? If so, then:

forman commented 5 years ago

Our aim is to use some RESTful meta-service API that we can use from the CCI Toolbox. Again, we don't care about how this will be implemented on the server side. Timeouts on the clients may have various reasons - we want to know what the status on the server side.

forman commented 5 years ago

For example we just received a mail from Alison saying

Just to let you know that there was an issue with the ESGF update that we deployed yesterday, and to fix it, the OPeNDAP (and other ESGF access e.g. HTTP, WMS) will need to be taken offline this afternoon. I’ll let you know as soon as it’s back up and running, but it may be down all afternoon unfortunately. The portal front end and anonymous ftp download should be unaffected.

This is the stuff that we would like to pass over to our users in advance.

cpaulcox commented 5 years ago

I still don't understand how you expect the services section to be updated? If you want to know the status on the server side it suggests a manual update, which as I've mentioned before won't be workable for unscheduled outages, or some integration on-site with the opendap servers.

forman commented 5 years ago

I still don't understand how you expect the services section to be updated?

I don't know. I expect, some experts will find a solution.

E.g. using https://www.nagios.org/

JanisGailis commented 5 years ago

Just to chime in this discussion a bit. Here are a few examples of how widely used and known services convey status information to their users:

http://status.gandi.net/timeline https://status.twitterstat.us/# https://status.status.io/

How exactly the status of a particular system of a particular service is determined and updated is of course specific to each system. From the users' perspective, however, a trusted, machine readable channel is provided.

forman commented 5 years ago

Thanks @JanisGailis !

cpaulcox commented 5 years ago

Interesting. The gandi.net one illustrates a couple of points that I'm trying to make above.

forman commented 5 years ago

I'm going to address this now by separating network errors from others, so the GUI can show a different error dialog.

forman commented 5 years ago

Now showing the following error dialogs:

image