As an SA, I want a monitoring capability to check that PDS search and access services are up and running 99.9% of the time

jordanpadams commented 3 years ago

For more information on how to populate this new feature request, see the PDS Wiki on User Story Development:

https://github.com/NASA-PDS/nasa-pds.github.io/wiki/Issue-Tracking#user-story-development

Motivation

...so that we can ensure the 99.99% uptime "requirement" we are striving for across all our systems

Additional Details

This goes beyond the systems monitoring the SAs have setup for our machines. That monitoring only ensures the services are up and running ⚙️ (e.g. Tomcat hasn't crashed), but we need some tool or service that makes sure the API actually returns data.

Acceptance Criteria

Given a registry ingestion fails and wipes out the database (or name your favorite service) When I perform nothing Then I expect to get a notification that something is wrong with the registry

Engineering Details

We should think of this as some sort of generic "ping" software (or Github Action), that kicks all our services every so often to see if they are awake and functioning properly. At minimum, this should support the following services to start:

[ ] Legacy Keyword Search (e.g. https://pds.nasa.gov/services/search/search?wt=json&q=identifier:%22urn:nasa:pds:context:instrument_host:spacecraft.mars2020)
[ ] PDS API (which will also test the registry)
[ ] DOI Service

nutjob4life commented 3 years ago

Our handy-dandy system administrators run Nagios for various services for the Early Detection Research Network. Perhaps we can leverage that?

nutjob4life commented 3 years ago

Regardless of how it's launched, some of these we can handle with just "curl", and some will need some scripts (end-to-end testing—thinking of the repeated issues I've had with the API client).

Researching further: GitHub Actions does support periodic event triggers: https://docs.github.com/en/actions/reference/events-that-trigger-workflows#scheduled-events

jordanpadams commented 3 years ago

@nutjob4life agreed. will Nagios do things like curl checks? I thought that was more used for "service in running" versus "service is running and returning data", but there is a good chance I have no idea what I am talking about.

per the curl vs. end-to-end testing, agreed here. I was thinking this could be a two-phased approach.

phase 1: simple, spot checks to make sure the services are running (I think Github Actions could be perfect here. i've used the periodic event triggers for validate to update our context product config) phase 2: end-to-end testing integrate into our CD pipeline (e.g. https://github.com/NASA-PDS/pds-api/issues/51) for the tools/services

jordanpadams commented 2 years ago

deferring this to B12.1 since this is a much more comprehensive effort

NASA-PDS / devops