cvmfs-contrib / cvmfs-servermon

CernVM File System Server Monitoring API
BSD 2-Clause "Simplified" License
3 stars 4 forks source link

CVMFS Stratum One monitoring assist

To assist with monitoring cvmfs servers, there is a separate rpm called \"cvmfs-servermon\". It interprets conditions on cvmfs servers and makes them available in a friendly API. Currently it monitors two aspects on Stratum 1 servers (serving \"replicas\") and one on Stratum 0 servers (a.k.a \"release managers\"), and it is designed to be extended as more test cases are added. It is intended to be very easy to tie into any local monitoring system that can probe over http. There is also a monitoring probe at CERN that uses the interface to monitor many stratum 1s.

cvmfs-servermon can be configured to read from more than one remote machine, but by default it is configured to read from localhost and that\'s the easiest way to use it.

If you have a good idea for extension or have any problems please create a github issue.

Installation and configuration

To install on a RHEL7-compatible or RHEL8-compatible machine, do the following. If you have not yet set up the cvmfs-contrib repository, first do that as instructed on the cvmfs-contrib home page.

Then install cvmfs-servermon:

# yum install -y cvmfs-servermon

Configuration is optional in a simple file /etc/cvmfsmon/api.conf. In there you can define aliases for remote machines, list repositories you want to exclude from monitoring, list tests you want to disable from running, and change the default test limits. See the comments in the file.

If you are using a shared cvmfs httpd configuration file and not letting the cvmfs_server command manage the httpd configuration itself, then it needs a small modification. In particular, with the configuration recommended on the StratumOnes twiki, add :/usr/share/cvmfs-servermon/webapi to the end of the WSGIDaemonProcess python-path. Reload httpd after making that change.

API

The web API is very simple. URLs are of the following format:

/cvmfsmon/api/v1.0/montests&param1=value1&param2=value2

\"montests\" are currently one of the following:

  1. \"ok\" - always returns OK (useful for just getting a list of repositories)
  2. \"all\" - runs all applicable tests but \'ok\'
  3. \"updated\" - verifies that updates are happening to the repositories of a stratum 1. If no updates have happened in the previous 8 hours, a repository is considered to be OK. If updates last occurred between 8 and 24 hours ago, a repository will be in WARNING condition. If the last update happened more than 24 hours ago, a repository will be in CRITICAL condition. The limits of 8 and 24 can be changed in /etc/cvmfsmon/api.conf. Individual repositories that are slower to update than others can be listed in updated-slowrepo keywords in /etc/cvmfsmon/api.conf and their limits for WARNING and CRITICAL multiplied by the number specified in limit updated-multiplier.
  4. \"gc\" - verifies that repositories that have ever had garbage collection run on them, on a stratum 0 or a stratum 1, have successfully completed garbage collection recently. If no successful garbage collections have happened in the last 10 days and less than 20 days ago, the repository will be in a WARNING condition, and it will be in CRITICAL condition if the last successful garbage collection was more than 20 days ago. The limits can be changed in /etc/cvmfsmon/api.conf.
  5. \"geo\" - verifies that the geo api on a stratum 1 successfully responds with a server order for a test case on one repository. If there is no order returned, the test will be in CRITICAL condition, and it will be in WARNING condition if the wrong order is returned. The test also monitors geodb age: it will be in WARNING condition if the geodb was last updated more than 30 days ago.
  6. \"whitelist\" - verifies that the .cvmfswhitelist file on a stratum 0 or stratum 1 is not expired. If the expiration time is less than 48 hours away (by default), a repository will be in WARNING condition, and it will be in CRITICAL condition if the whitelist file is expired. The warning limit can be changed in /etc/cvmfsmon/api.conf.
  7. \"check\" - verifies that cvmfs_server check on a stratum 0 or stratum 1 did not have any failures. A repository will be in WARNING condition if there was a failure the last time cvmfs_server check ran on the repository.

The params are all optional. The currently supported params are:

  1. \"format\" - value is one of the following:
    1. \"status\" - only returns one of the following on one line: OK, WARNING, or CRITICAL. The condition returned is the worst one of any of the tests.
    2. \"list\" - (default if format not specified) - reports one line for each current status (in the order of CRITICAL, WARNING, OK) followed by a colon and a comma-separated list of repositories in that condition.
    3. \"details\" - returns a detailed json-formatted list of all conditions of every montest, the repositories in those conditions, and any messages explaining the conditions.
  2. \"server\" - value is an alias defined in /etc/cvmfsmon/api.conf. Default is \"local\" which maps to the hostname \"localhost\".

Examples

Try clicking on the following or reading them with curl or wget:

CERN XSLS availability monitor

cvmfs-servermon is intended to be used easily by any site\'s own monitoring system, but there is also a monitoring system at CERN that tracks the status of all the major stratum 1s that support cvmfs-servermon. The CERN monitoring system runs every 15 minutes, and whenever the status has changed for two probes in a row it sends an email to the cvmfs-stratum-alarm@cern.ch mailing list. For a graphical history it also uploads the status to CERN\'s grafana-based Service Availability website (via the mechanism documented here). If you'd like a change to the stratum 1s that are monitored, contact cvmfs-servermon-support@cern.ch. In order to be monitored, a stratum 1 needs to either be running cvmfs-server-2.2.X or later, or have cvmfs-servermon installed (or both).

The machine at CERN that is doing the probes is wlcg-squid-monitor.cern.ch. cvmfs-servermon is installed there, so it can read the status remotely from stratum 1s. The primary advantage to running cvmfs-servermon on the stratum 1s themselves is that that allows the stratum 1 administrator to choose when to exclude a repository from monitoring (by configuring it in /etc/cvmfsmon/api.conf). Also, that reduces the number of remote TCP connections needed; a remote cvmfs-servermon has to read the status of each repository separately.