adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

Bring Nagios Monitoring to the fore #1229

Closed karianna closed 3 years ago

karianna commented 4 years ago

This was a question as to whether to keep it. We think we should, but we need to bring it into our monitoring regime.

Willsparker commented 4 years ago

(ref https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/1228 )

gdams commented 4 years ago

This has been open for a while now and nobody has spoken up. Do we all agree to shut down the nagios server?

karianna commented 4 years ago

@sxa your call as the main infra rep - do you want something of this nature? Seems odd that we don't have anything unless Jenkin's reporting is deemed good enough

sxa commented 4 years ago

I don't personally feel there's any reason to shut it down - ideally we'd be making use of it, but I'm not at present because too many other things keep coming up. That doesn't mean it's not a useful thing to have in place.

I'm not aware that anyone has a reason in favour of it being shut down completely.

karianna commented 4 years ago

OK, so I'll relabel this as 'give it a spruce up'

tellison commented 4 years ago

Nagios should at least be updated to ensure we remain secure there. The latest version of Core is now 4.4.6.

karianna commented 4 years ago

Paging @Willsparker as he wanted to look at this from the other ticket on monitoring SSL certs

aahlenst commented 4 years ago

If bringing Nagios back to life requires a lot of work, it might make sense to check beforehand where that thing is located and what else needs to be done so that's ready for the future security- and performance-wise.

Willsparker commented 4 years ago

I can look at updating the playbooks to have the latest version of Nagios, that (hopefully) shouldn't be a problem :-)

aahlenst commented 4 years ago

@sxa @karianna If I understand correctly, there's no Nagios server at the moment. If that's true, can we re-evaluate and write down how we ended up with Nagios and which edition we're going to use? As I already said on Slack, if we have Nagios Core only, Icinga might be the better choice.

karianna commented 4 years ago

@Willsparker / all - we actually have a Nagios Master in place already at 78.47.239.96

Willsparker commented 4 years ago

Oh cool :-) I'm currently looking at installing it on a VM and playing around with it to figure it out

edit: It has the superuser on it so I can login too :+1:

Willsparker commented 4 years ago

I'm going to start looking at this, but I thought I'd ask what everyone would want to actually be monitored via Nagios? Currently there's a lot of default checks for each host that I don't think are entirely necessary (i.e. the 'PING' service- surely by virtue of the other services running on the host, a connection issue would be found via these, and a PING service becomes unnecessary). The vast majority of these checks also notify the #infrastructure-bot which result in an awful lot of output that ends up becoming white noise, so certain services could notify the relevant slack channels, or if the service isn't really important, notifications could be disabled entirely.

So - what services should run for each type of machine (i.e. build, test, infra, perf), where do the services notify if something goes wrong (if at all), and are there any special exceptions (i.e. the ci.adoptopenjdk.net machine will have a service to monitor it's SSL certificate: #1568 )

aahlenst commented 4 years ago

Nagios and its forks indeed report if no result comes back some way or another (state "unknown").

As we're dealing with build/test servers, we should check for the problems that concern us. Maxing out RAM/CPU probably does not, filling up the disk does.

Shoot from the hip:

For the Jenkins server, TRSS and other servers that provide services: monitoring CPU, RAM, SSL certificate might be a good idea, too. The SSL certificate check would also be good for the website and the API.

aahlenst commented 4 years ago

Looking at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1602, it might make sense to monitor RAM.

Willsparker commented 4 years ago

Maybe all of those as a once-a-day-check? Possibly in the morning so it doesn't run at the same time as nightlies. Also what Network time sync plugin would that be? Currently we're not monitoring Windows, so that's something I should probably look into as well :-)

karianna commented 4 years ago

Once a day sounds good. Mornings (EMEA time) probably sane. Network time sync probably tries to keep all of the machines sync'd timewise (you can see the drift when you look at all of the nodes in Jenkins actually).

Windows would be great 👍

Willsparker commented 4 years ago

Okay, cool - I'll get started on that then :-) I'll keep a backup of all the old config files in a directory somewhere, just in case.

Willsparker commented 4 years ago

(note to self) Useful Documentation / resources I found (I'll update this as I go) : "Setting up a Nagios Server on Ubuntu1604" : https://www.howtoforge.com/tutorial/how-to-install-nagios-on-ubuntu-16-04/ "Template-Based Object Configuration" : http://nagios.manubulon.com/traduction/docs25en/xodtemplate.html "Event Handlers" : https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html

Willsparker commented 4 years ago

I've been looking through and making a script to generate a config file for each host automatically (I'll put it here once it's done), but I was looking at adding the service to query if a given node was connected to Jenkins, and there doesn't appear to be an immediately obvious way. I'd be able to use the check_by_ssh to run a script that looks for a java process running Jenkins, but that doesn't necessarily mean it's running as expected

aahlenst commented 4 years ago

Only Jenkins knows what is connected and what not. Therefore, I'd query https://ci.adoptopenjdk.net/computer/api/json?pretty=true. This would also allow us to define sets of nodes and be alerted if there are, for example, less than X machines with a specific label.

Willsparker commented 4 years ago

Is there anyway to query the API to return the info for a single node? Can't find any documentation to show how to use the API

aahlenst commented 4 years ago

Append /api to any URL you open in Jenkins and you get the API. https://ci.adoptopenjdk.net/computer/build-azure-win2012r2-x64-1/api/json?pretty=true gives you info about build-azure-win2012r2-x64-1.

Willsparker commented 4 years ago

Ah! Excellent, thanks very much :-)

Willsparker commented 4 years ago

Okay, I wrote a script which I've been able to get working in Nagios

image

** For the purposes of testing, I called the node build-scaleway-ubuntu1604-x64-1; It's actually just a VM running on my machine, but the check_jenkins command I made uses whats defined as the hostname to query the Jenkins API

#!/bin/bash

if [ -z $1 ]; then
  echo "UNKNOWN- Invalid arguments"
  echo "Usage: $0 < agent_name >"
  exit 3
fi

wget -q https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true -O jenkins_query_$1
if [[ $? != 0 ]]; then
  echo "UNKNOWN- Failed to get agent information"
  rm jenkins_query_$1
  exit 3
fi

is_agent_offline=$(awk '/"offline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
is_agent_temp_offline=$(awk '/"temporarilyOffline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
rm jenkins_query_$1

if [[ $is_agent_offline == "false" ]]; then
  echo "OK - Jenkins Agent is connected"
  exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
  echo "WARNING - Jenkins Agent temporarily disconnected"
  exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
  echo "CRITICAL - Jenkins agent is fully disconnected"
  exit 2
else
  echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
  exit 3
fi

Pretty simple, Syntax is ./check_agent <Name of Node>, and it runs on the Nagios Server itself, as it just queries the Jenkins API- also means we only have to put it on one machine instead of ~100ish.

aahlenst commented 4 years ago

Looks great, apart from one thing: awk isn't the best choice for querying JSON. https://stedolan.github.io/jq/ is much more reliable and digestible. curl is also more friendly for saving the response in a variable. Saves you the temporary file and the problems associated with it.

Pulled from a script on my disk:

CURL_RESPONSE=$(curl -s -H "Accept: application/json" -H "Authorization: Bearer $TOKEN" "https://example.com")
Willsparker commented 4 years ago

Updated to use JQ and curl :+1:

#!/bin/bash

if [ -z $1 ]; then
  echo "UNKNOWN - Invalid arguments"
  echo "Usage: $0 < agent_name >"
  exit 3
fi

if ! command -v jq &> /dev/null; then
  echo "UNKNOWN - JQ isn't installed"
  exit 3
fi

CURL_RESPONSE=$(curl -s https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true)
if [[ $? != 0 ]]; then
  echo "UNKNOWN- Failed to get agent information"
  exit 3
fi

is_agent_offline=$(echo $CURL_RESPONSE | jq .offline)
is_agent_temp_offline=$(echo $CURL_RESPONSE | jq .temporarilyOffline )

if [[ $is_agent_offline == "false" ]]; then
  echo "OK - Jenkins Agent is connected"
  exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
  echo "WARNING - Jenkins Agent temporarily disconnected"
  exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
  echo "CRITICAL - Jenkins agent is fully disconnected"
  exit 2
else
  echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
  exit 3
fi
aahlenst commented 4 years ago

Looks great.

Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.

Willsparker commented 4 years ago

On the Nagios server, I've made a backup of the objects and servers directories (Just in case) at /usr/local/nagios/cfg_backup_281020. I'm going to start looking at starting to generate the .cfg files for all the servers. I've tested this and managed to get it working, but if there's anything I'm missing, let me know :-)

#!/bin/bash

[[ ! -f $1 ]] && echo "Input a variable file"

source $1
export FILENAME="$HOSTNAME.cfg"

case $(echo "$DISTRO" | tr -d [:digit:] | tr [:upper:] [:lower:]) in
  "ubuntu" | "debian") 
    PKGMNGR="apt";;
  "rhel" | "centos")
    PKGMNGR="yum";;
esac

echo "DEBUG:
  FILENAME: $FILENAME
  HOSTNAME: $HOSTNAME
  ALIAS   : $ALIAS
  ADDRESS : $IP_ADDRESS
  DISTRO  : $DISTRO
  PKGMNGR : $PKGMNGR
  SPECIAL : $EXTRA
"  

echo " # Checks SSH to determine if the host is available
define host {
        use                             linux-server
        host_name                       $HOSTNAME
        alias                           $ALIAS
        address                         $IP_ADDRESS
        check_command                   check_ssh!-4 -t 60
        max_check_attempts              5
        check_period                    24x7
        notification_interval           30
        notification_period             24x7
}" >> $FILENAME

echo "define service {
        use                             generic-service        
    host_name                       $HOSTNAME
        service_description             Disk Usage
        check_command                   check_remote_disk!10%!5%!/
        check_period                    once-a-day-at-8
}" >> $FILENAME

echo "define service {
        use                             generic-service        
        host_name                       $HOSTNAME
        service_description             Updates-Required - $PKGMNGR
        check_command                   check_remote_${PKGMNGR}
        check_period                    once-a-day-at-8
}" >> $FILENAME

echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check Free Memory
        check_command                   check_remote_mem!10!5
        check_interval                  30
}" >> $FILENAME

# This only runs with centos/rhel 7+, as centos6 doesn't uses systemd
if [[ $(echo "$DISTRO" | tr -d [:alpha:]) != 6 ]]; then
  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Network Time Sync
        check_command                   check_remote_timesync
        check_period                    once-a-day-at-8
}" >> $FILENAME
fi

echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        host_name                       $HOSTNAME
        service_description             Check if Jenkins Agent Connected
        check_command                   check_agent!$HOSTNAME
        check_period                    once-a-day-at-8
}" >> $FILENAME

# Only for the servers that need SSL certification
if [[ $EXTRA == 1 ]]; then
  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check CPU Load
        check_command                   check_remote_load
        check_interval                  10
}" >> $FILENAME

  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check_SSL_Cert
        check_command                   check_ssl_cert!$HOSTNAME         
        check_period                    once-a-day-at-8
}" >> $FILENAME
fi 

An example of the variable file is as follows:

export HOSTNAME="build-test-test-x64-1"
export ALIAS="Build Host"
export IP_ADDRESS="127.0.0.1"
export DISTRO=CentOS7
export EXTRA=1

The once-a-day-at-8 time period is defined as :

define timeperiod{
        timeperiod_name once-a-day-at-8
        alias           Between 8am 9am GMT everyday
        sunday          9:00-10:00
        monday          9:00-10:00
        tuesday         9:00-10:00
        wednesday       9:00-10:00
        thursday        9:00-10:00
        friday          9:00-10:00
        saturday        9:00-10:00
}

According to a note left by Brad Blondin, the Nagios server is on CEST time, so to get 8-9am in GMT, the server will be 9-10am ( I think ).

The extra commands that need to be defined in /usr/local/nagios/etc/objects/commands.cfg are as follows:

##############
#
# COMMANDS ADDED (By Willsparker)
#
##############

define command{
        command_name    check_remote_disk
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$'
}

define command{
        command_name    check_remote_yum
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_yum -t 60'
}

define command{
        command_name    check_remote_apt
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_apt -t 60'
}

define command{
        command_name    check_remote_load
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$'
}

# Note: This plugin needs to be manually installed on remote nodes: https://github.com/justintime/nagios-plugins/tree/master/check_mem
define command{
        command_name    check_remote_mem
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_mem -f -C -w $ARG1$ -c $ARG2$'
}

# Note: This plugin needs to be manually installed on remote nodes
define command{
        command_name    check_remote_timesync
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_timesync'
}

# Note: This plugin needs to be manually installed on the Nagios server
define command{
        command_name    check_agent
        command_line    $USER1$/check_agent $ARG1$
}

# Note: This plugin needs to be manually installed on the Nagios server
define command{
        command_name    check_ssl_cert
        command_line    $USER1$/check_ssl_cert -H $ARG1$
}

I think that's all the prep work I need to do before re-doing the Nagios setup, except the notifications - 1) Should we keep it as it is, with nagios pinging the #infrastructure-bot channel?
2) Should I alter the notification period / interval? 3) Do all tasks need notifications enabled?

Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.

@aahlenst I'll have a look at adding that today :-)

aahlenst commented 4 years ago

@Willsparker Thanks for the great work. Question: Why no Ansible playbook for the config?

Willsparker commented 4 years ago

Honestly, I wasn't aware that the playbook could be used for the config :sweat_smile: I'll look to see if I can use the roles, and merge the script I wrote above, into the Nagios_Ansible_Config_tool.sh script mentioned in https://github.com/AdoptOpenJDK/openjdk-infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/Nagios_Master_Config/tasks/main.yml It'll save me a lot of manual work :-)

Willsparker commented 4 years ago

@aahlenst Script to check the percentage of machines online in the label

#!/bin/bash

if [ -z $1 ] || [ -z $2 ] || [ -z $3 ]; then
  echo "UNKNOWN - Invalid arguments"
  echo "Usage: $0 <Label> <Warning_Level> <Critical_Level>"
  exit 3
fi

if ! command -v jq &> /dev/null; then
  echo "UNKNOWN - JQ isn't installed"
  exit 3
fi

# Get list of machines in label
mapfile -t machine_array < <(curl -s https://ci.adoptopenjdk.net/label/$1/api/json | jq '.nodes[] | .nodeName' | sed 's/\"//g') 

# For each machine, query if they're connected
response_array=()
for node in ${machine_array[@]}
do
  response_array+=($(curl -s "https://ci.adoptopenjdk.net/computer/${node}/api/json" | jq .offline)) 
done

online=0
offline=0
for response in ${response_array[@]}
do
  if [[ ${response} == "false" ]]; then 
    online=$((online+1))
  else
    offline=$((offline+1))
  fi
done

export percentage_online=$(echo "scale=2; ($online/($offline+$online)) * 100" | bc -l)
if (( $(echo "$percentage_online < $3" | bc -l) )); then
  echo "CRITICAL - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 2 
elif (( $(echo "$percentage_online < $2" | bc -l) )); then 
  echo "WARNING - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 1
else
  echo "OK - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 0
fi

Any requested changes?

aahlenst commented 4 years ago

More questions than change requests:

Willsparker commented 4 years ago

I think ... I've got it working using the Tools that Brad Blondin made when he initially setup the Nagios stuff! image

Here's a list of things I had to do to make it work (which may or may not be related to me running this on a machine that isn't in the inventory):

For putting in the label checker script, I'll manually put that in, and have it run on the Nagios server :-)

Willsparker commented 4 years ago

For the sake of doublely testing, I'm going to try this on a C7 machine that @sxa provided me, but I'll connect this to jenkins (and not put any labels on it), to see if this fixes a lot of my previous issues. Once I've got that working, I can do this for all the machines :-)

Willsparker commented 4 years ago

@aahlenst Removed the export - Not sure why I put that in. And apparently it does! (completely intentional!) You just need to make sure you put the labels in speech marks, otherwise bash doesn't like the & signs

will@will-XPS-13-9360:~/Documents/nagios_testing$ ./check_label.sh "build&&linux&&s390x" 50 30
OK - 100.00% machines online in 'build&&linux&&s390x' label
2 online machines; 0 offline machines
Willsparker commented 4 years ago

With the help of @gdams , We were able to get the test-aws-ubuntu1804-armv8-1 machine added to Nagios. Final list of stuff to do:

Additional:

aahlenst commented 4 years ago

Setup a cron job on the Nagios Server that queries Jenkins, and removes the machine entry from /usr/local/nagios/etc/servers/, if it's no longer in Jenkins.

I'm not particularly keen on using Jenkins as the source of truth for our inventory. I'd rather let Ansible handle that (we have to remove the servers from the inventory anyway). Instead, I'd add a check that alerts us if a machine pops up in Jenkins that isn't known to Nagios (with the possibility to ignore dynamic agents).

Willsparker commented 4 years ago

The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there. And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.

Willsparker commented 4 years ago

Nagios Core has been updated to 4.4.6 from 4.3.4, following this guide

Willsparker commented 4 years ago

@aahlenst What labels should we be checking for?

aahlenst commented 4 years ago

The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there.

Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers.

And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.

This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.

Willsparker commented 4 years ago

Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers. This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.

Both good points. When you say Any unknown new machine being added should trigger a red alert. , are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental) ?

aahlenst commented 4 years ago

are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental

If a new machine appears in Jenkins that is not known to Nagios, it should trigger an alert. For that to work, we need a list of known machines. One possibility to achieve that could be that we combine our inventory with a list of experimental machines that is maintained alongside the Ansible inventory.

Willsparker commented 4 years ago

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios. Otherwise, I would have concerns with maintaining 2 separate lists- before I looked at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/619 , even the inventory wasn't very well maintained.

Willsparker commented 4 years ago

Changed the timeperiod once-a-day-at-8 to 9:00-10:00, as (due to the nagios server being in CET) that refers to 8-9 in GMT, which should hopefully be after all the nightlies have finished.

Willsparker commented 4 years ago

I've been able to add all the machines I've got access to, as well as manually changed some of the machines I don't have access to. The following are the machines that are yet to be added to/updated in Nagios:

build-linaro-centos76-armv8-2 (timeout)
build-packet-ubuntu1804-armv8-1 (Connection closed)

docker-aws-ubuntu1604-x64-1 x
docker-aws-ubuntu1604-x64-2 x
docker-godaddy-ubuntu1604-x64-1 x
docker-scaleway-ubuntu1604-armv7-1 (Connection Closed)

test-ibm-aix71-ppc64-1 x
test-ibm-aix71-ppc64-2 x
test-ibmcloud-ubuntu1604-x64-1 x
test-macstadium-macos11-arm64-1 x
test-macstadium-macos11-arm64-2 x
aahlenst commented 4 years ago

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios.

Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.

andrew-m-leonard commented 3 years ago

@Willsparker adding to your requirements, I would like to see "Alerts" added to Warn (via Slack) that a given Node "Free disk space" is within a certain margin(say ~3Gb) or our 10Gb Jenkins offline limit,ie. Warn if <13Gb free ! This is so we get a heads up that a node is near to being taken offline by Jenkins and we can act before if fails a nightly build....

sxa commented 3 years ago

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios.

Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.

That should absoutely be accurate as far as "live" machines are concerned and I've been acting to resolve any discrepencies as soon as they show up, so hopefully we don't need a ticket with a long list of todos on that one ;-) Just PRs put in to fix them. So yes that list should be definitive.