SovereignCloudStack / openstack-health-monitor

Script to monitor wheather an OpenStack cloud is working correctly
https://scs.community/
Other
13 stars 4 forks source link
openstack openstack-environment

OpenStack Health Monitor

This is a test script for testing the reliability and performance of OpenStack API. It works by doing a real scenario test: Setting up a real environment With routers, nets, jumphosts, disks, VMs, ...

We collect statistics on API call performance as well as on resource creation times. Failures are noted and alarms are generated.

Status

TODO

Copyright

(c) Kurt Garloff kurt.garloff@t-systems.com, 2/2017-7/2017 (c) Kurt Garloff scs@garloff.de, 2/2020-4/2021 (c) Kurt Garloff garloff@osb-alliance.com, 5/2021-9/2022

License: CC-BY-SA (2.0)

Description of the flow

Coverage

So we end up testing: Router, incl. default route (for SNAT instance), networks, subnets, and virtual IP, security groups and floating IPs, volume creation from image, deletion after VM destruction, VM creation from bootable volume (and from image if -d is given,) Metadata service (without it ssh key injection fails of course), Images (openSUSE OTC, upstream, CentOS and Ubuntu work), Loadbalancer (-L/-LL), Waiting for volumes and VMs, Destroying all of these resources again

Alarming and reporting

We do some statistics on the duration of the steps (min, avg, median, 95% quantile, max). We of course also note any errors and timeouts and report these, optionally sending email or SMN (OTC notifications via SMS and other media) alarms.

Runtime

This takes rather long, as typical API calls take b/w 1 and 2s on OpenStack (including the round trip to keystone for the token).

Optimization possibilities: Cache token and reuse when creating a large number of resources in a loop. Completed (use option -O (not used for volume create)).

Prerequisites

I typically set this up on openSUSE-15.x images that come with all these tools (except sendmail) preinstalled -- get them at https://kfg.images.obs-website.eu-de.otc.t-systems.com/. I tiny flavor is enough to run this (1GiB RAM, 5GB disk) -- watch the logfiles though to avoid them filling up your disk. If you set up the dashboard with telegraf, influxdb, grafana, I would recommend a larger flavor (4GiB RAM, 20GB disk).

Usage

Use api_monitor.sh -h to get a list of the command line options. For reference find the output (from v1.109) here:

Running api_monitor.sh v1.109 on host oshm-driver5 with arguments -h
Using APIMonitor_1714160976_ prefix for resrcs on ciab (nova)
Usage: api_monitor.sh [options]
 --debug Use set -x to print every line executed
 -n N   number of VMs to create (beyond #AZ JumpHosts, def: 12)
 -N N   number of networks/subnets/jumphosts to create (def: # AZs)
 -l LOGFILE record all command in LOGFILE
 -a N   send at most N alarms per iteration (first plus N-1 summarized)
 -R     send recovery email after a completely successful iteration and alarms before
 -e ADR sets eMail address for notes/alarms (assumes working MTA)
         second -e splits eMails; notes go to first, alarms to second eMail
 -E     exit on error (for CONNTEST)
 -m URN sets notes/alarms by SMN (pass URN of queue)
         second -m splits notifications; notes to first, alarms to second URN
 -s [SH] sends stats as well once per day (or every SH hours), not just alarms
 -S [NM] sends stats to grafana via local telegraf http_listener (def for NM=api-monitoring)
 -q     do not send any alarms
 -d     boot Directly from image (not via volume)
 -vt TP use volumetype TP (overrides env VOLUMETYPE)
 -z SZ  boots VMs from volume of size SZ
 -Z     do not create volume for JHs separately
 -P     do not create Port before VM creation
 -D     create all VMs with one API call (implies -d -P)
 -i N   sets max number of iterations (def = -1 = inf)
 -r N   only recreate router after each Nth iteration
 -g N   increase VM volume size by N GB (ignored for -d/-D)
 -G N   increase JH volume size by N GB
 -w N   sets error wait (API, VM): 0-inf seconds or neg value for interactive wait
 -W N   sets error wait (VM only): 0-inf seconds or neg value for interactive wait
 -V N   set success wait: Stop for N seconds (neg val: interactive) before tearing down
 -p N   use a new project every N iterations
 -c     noColors: don't use bold/red/... ASCII sequences
 -C     full Connectivity check: Every VM pings every other
 -o     translate nova/cinder/neutron/glance into openstack client commands
 -O     like -o, but use token_endpoint auth (after getting token)
 -x     assume eXclusive project, clean all floating IPs found
 -I     dIsassociate floating IPs before deleting them
 -L     create HTTP Loadbalancer (LBaaSv2/octavia) and test it
 -LL    create TCP  Loadbalancer (LBaaSv2/octavia) and test it
 -LP PROV  create TCP LB with provider PROV test it (-LO is short for -LP ovn)
 -LR    reverse order of LB healthmon and member creation and deletion
 -b     run a simple compute benchmark (4k pi with bc)
 -B     measure TCP BW b/w VMs (iperf3)
 -M     measure disk I/O bandwidth & latency (fio)
 -t     long Timeouts (2x, multiple times for 3x, 4x, ...)
 -T     assign tags to resources; use to clean up floating IPs
 -2     Create 2ndary subnets and attach 2ndary NICs to VMs and test
 -3     Create 2ndary subnets, attach, test, reshuffle and retest
 -4     Create 2ndary subnets, reshuffle, attach, test, reshuffle and retest
 -R2    Recreate 2ndary ports after detaching (OpenStack <= Mitaka bug)
Or: api_monitor.sh [-f] [-o/-O] CLEANUP XXX to clean up all resources with prefix XXX
        Option -f forces the deletion
Or: api_monitor.sh [Options] CONNTEST XXX for full conn test for existing env XXX
        Options: [-2/3/4] [-o/O] [-i N] [-e ADR] [-E] [-w/W/V N] [-l LOGFILE]
You need to have the OS_ variables set to allow OpenStack CLI tools to work.
You can override defaults by exporting the environment variables AZS, VAZS, NAZS, RPRE,
 PINGTARGET, PINGTARGET2, GRAFANANM, [JH]IMG, [JH]IMGFILT, [JH]FLAVOR, [JH]DEFLTUSER,
 ADDJHVOLSIZE, ADDVMVOLSIZE, SUCCWAIT, ALARMPRE, FROM, ALARM_/NOTE_EMAIL_ADDRESSES, VOLUMETYPE,
 NAMESERVER/DEFAULTNAMESERVER, SWIFTCONTAINER, FIPWAITPORTDEVOWNER, EXTSEARCH, OS_EXTRA_PARAMS.
Typically, you should configure OS_CLOUD, [JH]IMG, [JH]FLAVOR, [JH]DEFLTUSER.

Examples

Run 100 loops deploying (and deleting) 2+8 VMs (including nets, volumes etc.), with daily statistics sent to SMN...API-Notes and Alarms to SMN...APIMonitor:

./api_monitor.sh -n 8 -s -m urn:smn:eu-de:0ee085d22f6a413293a2c37aaa1f96fe:APIMon-Notes -m urn:smn:eu-de:0ee085d22f6a413293a2c37aaa1f96fe:APIMonitor -i 100

The included file run.sh also demonstrates how to use api_monitor.sh.

The script has been used successfully on several OpenStack clouds with keystone v3 (OTC, ECP, CityCloud), started manually or from Jenkins, partially with recording stats to a local Telegraf to report timings and failures into a Grafana dashboard. Configuration files for telegraf, influxdb and a nice grafana dashboard can be found in the dashboard/ subdirectory.

HOWTO Guide

The directory docs contains a complete setup guide using Debian 12 VMs on an SCS reference deployement.

Benchmarks

There are three simple benchmarks included, -b for simple compute benchmark (calculating 4000 digits of pi with bc), -B for an iperf TCP benchmark between VMs and -M to measure the bandwidth, IOPS and Latency (the percentage of samples with >10ms latency is output) of the root disk. These are meant to detect trends and can help with capacity monitoring.