Closed karianna closed 3 years ago
This has been open for a while now and nobody has spoken up. Do we all agree to shut down the nagios server?
@sxa your call as the main infra rep - do you want something of this nature? Seems odd that we don't have anything unless Jenkin's reporting is deemed good enough
I don't personally feel there's any reason to shut it down - ideally we'd be making use of it, but I'm not at present because too many other things keep coming up. That doesn't mean it's not a useful thing to have in place.
I'm not aware that anyone has a reason in favour of it being shut down completely.
OK, so I'll relabel this as 'give it a spruce up'
Nagios should at least be updated to ensure we remain secure there. The latest version of Core is now 4.4.6.
Paging @Willsparker as he wanted to look at this from the other ticket on monitoring SSL certs
If bringing Nagios back to life requires a lot of work, it might make sense to check beforehand where that thing is located and what else needs to be done so that's ready for the future security- and performance-wise.
I can look at updating the playbooks to have the latest version of Nagios, that (hopefully) shouldn't be a problem :-)
@sxa @karianna If I understand correctly, there's no Nagios server at the moment. If that's true, can we re-evaluate and write down how we ended up with Nagios and which edition we're going to use? As I already said on Slack, if we have Nagios Core only, Icinga might be the better choice.
@Willsparker / all - we actually have a Nagios Master in place already at 78.47.239.96
Oh cool :-) I'm currently looking at installing it on a VM and playing around with it to figure it out
edit: It has the superuser on it so I can login too :+1:
I'm going to start looking at this, but I thought I'd ask what everyone would want to actually be monitored via Nagios?
Currently there's a lot of default checks for each host that I don't think are entirely necessary (i.e. the 'PING' service- surely by virtue of the other services running on the host, a connection issue would be found via these, and a PING service becomes unnecessary). The vast majority of these checks also notify the #infrastructure-bot
which result in an awful lot of output that ends up becoming white noise, so certain services could notify the relevant slack channels, or if the service isn't really important, notifications could be disabled entirely.
So - what services should run for each type of machine (i.e. build, test, infra, perf), where do the services notify if something goes wrong (if at all), and are there any special exceptions (i.e. the ci.adoptopenjdk.net
machine will have a service to monitor it's SSL certificate: #1568 )
Nagios and its forks indeed report if no result comes back some way or another (state "unknown").
As we're dealing with build/test servers, we should check for the problems that concern us. Maxing out RAM/CPU probably does not, filling up the disk does.
Shoot from the hip:
For the Jenkins server, TRSS and other servers that provide services: monitoring CPU, RAM, SSL certificate might be a good idea, too. The SSL certificate check would also be good for the website and the API.
Looking at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1602, it might make sense to monitor RAM.
Maybe all of those as a once-a-day-check
? Possibly in the morning so it doesn't run at the same time as nightlies.
Also what Network time sync plugin would that be? Currently we're not monitoring Windows, so that's something I should probably look into as well :-)
Once a day sounds good. Mornings (EMEA time) probably sane. Network time sync probably tries to keep all of the machines sync'd timewise (you can see the drift when you look at all of the nodes in Jenkins actually).
Windows would be great 👍
Okay, cool - I'll get started on that then :-) I'll keep a backup of all the old config files in a directory somewhere, just in case.
(note to self) Useful Documentation / resources I found (I'll update this as I go) : "Setting up a Nagios Server on Ubuntu1604" : https://www.howtoforge.com/tutorial/how-to-install-nagios-on-ubuntu-16-04/ "Template-Based Object Configuration" : http://nagios.manubulon.com/traduction/docs25en/xodtemplate.html "Event Handlers" : https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html
I've been looking through and making a script to generate a config file for each host automatically (I'll put it here once it's done), but I was looking at adding the service to query if a given node was connected to Jenkins, and there doesn't appear to be an immediately obvious way. I'd be able to use the check_by_ssh
to run a script that looks for a java process running Jenkins, but that doesn't necessarily mean it's running as expected
Only Jenkins knows what is connected and what not. Therefore, I'd query https://ci.adoptopenjdk.net/computer/api/json?pretty=true. This would also allow us to define sets of nodes and be alerted if there are, for example, less than X machines with a specific label.
Is there anyway to query the API to return the info for a single node? Can't find any documentation to show how to use the API
Append /api
to any URL you open in Jenkins and you get the API. https://ci.adoptopenjdk.net/computer/build-azure-win2012r2-x64-1/api/json?pretty=true gives you info about build-azure-win2012r2-x64-1.
Ah! Excellent, thanks very much :-)
Okay, I wrote a script which I've been able to get working in Nagios
** For the purposes of testing, I called the node build-scaleway-ubuntu1604-x64-1
; It's actually just a VM running on my machine, but the check_jenkins
command I made uses whats defined as the hostname to query the Jenkins API
#!/bin/bash
if [ -z $1 ]; then
echo "UNKNOWN- Invalid arguments"
echo "Usage: $0 < agent_name >"
exit 3
fi
wget -q https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true -O jenkins_query_$1
if [[ $? != 0 ]]; then
echo "UNKNOWN- Failed to get agent information"
rm jenkins_query_$1
exit 3
fi
is_agent_offline=$(awk '/"offline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
is_agent_temp_offline=$(awk '/"temporarilyOffline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
rm jenkins_query_$1
if [[ $is_agent_offline == "false" ]]; then
echo "OK - Jenkins Agent is connected"
exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
echo "WARNING - Jenkins Agent temporarily disconnected"
exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
echo "CRITICAL - Jenkins agent is fully disconnected"
exit 2
else
echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
exit 3
fi
Pretty simple, Syntax is ./check_agent <Name of Node>
, and it runs on the Nagios Server itself, as it just queries the Jenkins API- also means we only have to put it on one machine instead of ~100ish.
Looks great, apart from one thing: awk isn't the best choice for querying JSON. https://stedolan.github.io/jq/ is much more reliable and digestible. curl is also more friendly for saving the response in a variable. Saves you the temporary file and the problems associated with it.
Pulled from a script on my disk:
CURL_RESPONSE=$(curl -s -H "Accept: application/json" -H "Authorization: Bearer $TOKEN" "https://example.com")
Updated to use JQ
and curl
:+1:
#!/bin/bash
if [ -z $1 ]; then
echo "UNKNOWN - Invalid arguments"
echo "Usage: $0 < agent_name >"
exit 3
fi
if ! command -v jq &> /dev/null; then
echo "UNKNOWN - JQ isn't installed"
exit 3
fi
CURL_RESPONSE=$(curl -s https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true)
if [[ $? != 0 ]]; then
echo "UNKNOWN- Failed to get agent information"
exit 3
fi
is_agent_offline=$(echo $CURL_RESPONSE | jq .offline)
is_agent_temp_offline=$(echo $CURL_RESPONSE | jq .temporarilyOffline )
if [[ $is_agent_offline == "false" ]]; then
echo "OK - Jenkins Agent is connected"
exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
echo "WARNING - Jenkins Agent temporarily disconnected"
exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
echo "CRITICAL - Jenkins agent is fully disconnected"
exit 2
else
echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
exit 3
fi
Looks great.
Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.
On the Nagios server, I've made a backup of the objects
and servers
directories (Just in case) at /usr/local/nagios/cfg_backup_281020
. I'm going to start looking at starting to generate the .cfg
files for all the servers. I've tested this and managed to get it working, but if there's anything I'm missing, let me know :-)
#!/bin/bash
[[ ! -f $1 ]] && echo "Input a variable file"
source $1
export FILENAME="$HOSTNAME.cfg"
case $(echo "$DISTRO" | tr -d [:digit:] | tr [:upper:] [:lower:]) in
"ubuntu" | "debian")
PKGMNGR="apt";;
"rhel" | "centos")
PKGMNGR="yum";;
esac
echo "DEBUG:
FILENAME: $FILENAME
HOSTNAME: $HOSTNAME
ALIAS : $ALIAS
ADDRESS : $IP_ADDRESS
DISTRO : $DISTRO
PKGMNGR : $PKGMNGR
SPECIAL : $EXTRA
"
echo " # Checks SSH to determine if the host is available
define host {
use linux-server
host_name $HOSTNAME
alias $ALIAS
address $IP_ADDRESS
check_command check_ssh!-4 -t 60
max_check_attempts 5
check_period 24x7
notification_interval 30
notification_period 24x7
}" >> $FILENAME
echo "define service {
use generic-service
host_name $HOSTNAME
service_description Disk Usage
check_command check_remote_disk!10%!5%!/
check_period once-a-day-at-8
}" >> $FILENAME
echo "define service {
use generic-service
host_name $HOSTNAME
service_description Updates-Required - $PKGMNGR
check_command check_remote_${PKGMNGR}
check_period once-a-day-at-8
}" >> $FILENAME
echo "define service {
use generic-service
host_name $HOSTNAME
service_description Check Free Memory
check_command check_remote_mem!10!5
check_interval 30
}" >> $FILENAME
# This only runs with centos/rhel 7+, as centos6 doesn't uses systemd
if [[ $(echo "$DISTRO" | tr -d [:alpha:]) != 6 ]]; then
echo "define service {
use generic-service
host_name $HOSTNAME
service_description Network Time Sync
check_command check_remote_timesync
check_period once-a-day-at-8
}" >> $FILENAME
fi
echo "define service {
use generic-service
host_name $HOSTNAME
host_name $HOSTNAME
service_description Check if Jenkins Agent Connected
check_command check_agent!$HOSTNAME
check_period once-a-day-at-8
}" >> $FILENAME
# Only for the servers that need SSL certification
if [[ $EXTRA == 1 ]]; then
echo "define service {
use generic-service
host_name $HOSTNAME
service_description Check CPU Load
check_command check_remote_load
check_interval 10
}" >> $FILENAME
echo "define service {
use generic-service
host_name $HOSTNAME
service_description Check_SSL_Cert
check_command check_ssl_cert!$HOSTNAME
check_period once-a-day-at-8
}" >> $FILENAME
fi
An example of the variable file is as follows:
export HOSTNAME="build-test-test-x64-1"
export ALIAS="Build Host"
export IP_ADDRESS="127.0.0.1"
export DISTRO=CentOS7
export EXTRA=1
The once-a-day-at-8
time period is defined as :
define timeperiod{
timeperiod_name once-a-day-at-8
alias Between 8am 9am GMT everyday
sunday 9:00-10:00
monday 9:00-10:00
tuesday 9:00-10:00
wednesday 9:00-10:00
thursday 9:00-10:00
friday 9:00-10:00
saturday 9:00-10:00
}
According to a note left by Brad Blondin, the Nagios server is on CEST time, so to get 8-9am in GMT, the server will be 9-10am ( I think ).
The extra commands that need to be defined in /usr/local/nagios/etc/objects/commands.cfg
are as follows:
##############
#
# COMMANDS ADDED (By Willsparker)
#
##############
define command{
command_name check_remote_disk
command_line $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$'
}
define command{
command_name check_remote_yum
command_line $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_yum -t 60'
}
define command{
command_name check_remote_apt
command_line $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_apt -t 60'
}
define command{
command_name check_remote_load
command_line $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$'
}
# Note: This plugin needs to be manually installed on remote nodes: https://github.com/justintime/nagios-plugins/tree/master/check_mem
define command{
command_name check_remote_mem
command_line $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_mem -f -C -w $ARG1$ -c $ARG2$'
}
# Note: This plugin needs to be manually installed on remote nodes
define command{
command_name check_remote_timesync
command_line $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_timesync'
}
# Note: This plugin needs to be manually installed on the Nagios server
define command{
command_name check_agent
command_line $USER1$/check_agent $ARG1$
}
# Note: This plugin needs to be manually installed on the Nagios server
define command{
command_name check_ssl_cert
command_line $USER1$/check_ssl_cert -H $ARG1$
}
I think that's all the prep work I need to do before re-doing the Nagios setup, except the notifications -
1) Should we keep it as it is, with nagios pinging the #infrastructure-bot
channel?
2) Should I alter the notification period / interval?
3) Do all tasks need notifications enabled?
Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.
@aahlenst I'll have a look at adding that today :-)
@Willsparker Thanks for the great work. Question: Why no Ansible playbook for the config?
Honestly, I wasn't aware that the playbook could be used for the config :sweat_smile: I'll look to see if I can use the roles, and merge the script I wrote above, into the Nagios_Ansible_Config_tool.sh
script mentioned in https://github.com/AdoptOpenJDK/openjdk-infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/Nagios_Master_Config/tasks/main.yml
It'll save me a lot of manual work :-)
@aahlenst Script to check the percentage of machines online in the label
#!/bin/bash
if [ -z $1 ] || [ -z $2 ] || [ -z $3 ]; then
echo "UNKNOWN - Invalid arguments"
echo "Usage: $0 <Label> <Warning_Level> <Critical_Level>"
exit 3
fi
if ! command -v jq &> /dev/null; then
echo "UNKNOWN - JQ isn't installed"
exit 3
fi
# Get list of machines in label
mapfile -t machine_array < <(curl -s https://ci.adoptopenjdk.net/label/$1/api/json | jq '.nodes[] | .nodeName' | sed 's/\"//g')
# For each machine, query if they're connected
response_array=()
for node in ${machine_array[@]}
do
response_array+=($(curl -s "https://ci.adoptopenjdk.net/computer/${node}/api/json" | jq .offline))
done
online=0
offline=0
for response in ${response_array[@]}
do
if [[ ${response} == "false" ]]; then
online=$((online+1))
else
offline=$((offline+1))
fi
done
export percentage_online=$(echo "scale=2; ($online/($offline+$online)) * 100" | bc -l)
if (( $(echo "$percentage_online < $3" | bc -l) )); then
echo "CRITICAL - $percentage_online% machines online in '$1' label"
echo "$online online machines; $offline offline machines"
exit 2
elif (( $(echo "$percentage_online < $2" | bc -l) )); then
echo "WARNING - $percentage_online% machines online in '$1' label"
echo "$online online machines; $offline offline machines"
exit 1
else
echo "OK - $percentage_online% machines online in '$1' label"
echo "$online online machines; $offline offline machines"
exit 0
fi
Any requested changes?
More questions than change requests:
export percentage_online
?build&&linux&&s390x
?I think ... I've got it working using the Tools that Brad Blondin made when he initially setup the Nagios stuff!
Here's a list of things I had to do to make it work (which may or may not be related to me running this on a machine that isn't in the inventory):
Nagios_Ansible_Config_Tool.sh
to remove all the references to sys_pingtest
(this is due to the new template not having a ping service)template.cfg
to have the services we want to monitorinvalid key specified
, on this taskprovider
variableinventory_hostname
, as it was picking it up as 'localhost', when it should have been the IP Address for the machine...ansible-playbook
command with the -b
option, when it gets to a task that is delegated to localhost, it sudos
the localhost user too. Shouldn't be an issue if you run the playbook on the root user of a given machine (as the -b
option won't be used)network_time_sync
script on the machine. I used the copy
module for this, but when I PR this stuff, It'll be in the repo (I think- @aahlenst Am I able to commit that script to this repo, legally?)check_agent
command to the Nagios Server (I'll commit this in Supporting Scripts
, but it won't be needed in the playbook)
-Installed JQ on the Nagios ServerFor putting in the label checker script, I'll manually put that in, and have it run on the Nagios server :-)
For the sake of doublely testing, I'm going to try this on a C7 machine that @sxa provided me, but I'll connect this to jenkins (and not put any labels on it), to see if this fixes a lot of my previous issues. Once I've got that working, I can do this for all the machines :-)
@aahlenst Removed the export
- Not sure why I put that in. And apparently it does! (completely intentional!) You just need to make sure you put the labels in speech marks, otherwise bash doesn't like the &
signs
will@will-XPS-13-9360:~/Documents/nagios_testing$ ./check_label.sh "build&&linux&&s390x" 50 30
OK - 100.00% machines online in 'build&&linux&&s390x' label
2 online machines; 0 offline machines
With the help of @gdams , We were able to get the test-aws-ubuntu1804-armv8-1
machine added to Nagios.
Final list of stuff to do:
#infrastructure-bot
slack channel isn't constantly spammed.additional_plugins
.Nagios_*
roles on the machines.check_label
script to Nagios ServerAdditional:
/usr/local/nagios/etc/servers/
, if it's no longer in Jenkins (This should keep Nagios actually fairly useful, and not full of old machines we don't have)Nagios_*
playbook roles on)Setup a cron job on the Nagios Server that queries Jenkins, and removes the machine entry from
/usr/local/nagios/etc/servers/
, if it's no longer in Jenkins.
I'm not particularly keen on using Jenkins as the source of truth for our inventory. I'd rather let Ansible handle that (we have to remove the servers from the inventory anyway). Instead, I'd add a check that alerts us if a machine pops up in Jenkins that isn't known to Nagios (with the possibility to ignore dynamic agents).
The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there. And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.
Nagios Core has been updated to 4.4.6
from 4.3.4
, following this guide
@aahlenst What labels should we be checking for?
The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there.
Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers.
And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.
This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-
. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.
Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers. This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.
Both good points. When you say Any unknown new machine being added should trigger a red alert.
, are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental
) ?
are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental
If a new machine appears in Jenkins that is not known to Nagios, it should trigger an alert. For that to work, we need a list of known machines. One possibility to achieve that could be that we combine our inventory with a list of experimental machines that is maintained alongside the Ansible inventory.
We could use the config files in /usr/local/nagios/etc/servers/
to check for machines that are known to Nagios. Otherwise, I would have concerns with maintaining 2 separate lists- before I looked at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/619 , even the inventory wasn't very well maintained.
Changed the timeperiod once-a-day-at-8
to 9:00-10:00
, as (due to the nagios server being in CET) that refers to 8-9 in GMT, which should hopefully be after all the nightlies have finished.
I've been able to add all the machines I've got access to, as well as manually changed some of the machines I don't have access to. The following are the machines that are yet to be added to/updated in Nagios:
build-linaro-centos76-armv8-2 (timeout)
build-packet-ubuntu1804-armv8-1 (Connection closed)
docker-aws-ubuntu1604-x64-1 x
docker-aws-ubuntu1604-x64-2 x
docker-godaddy-ubuntu1604-x64-1 x
docker-scaleway-ubuntu1604-armv7-1 (Connection Closed)
test-ibm-aix71-ppc64-1 x
test-ibm-aix71-ppc64-2 x
test-ibmcloud-ubuntu1604-x64-1 x
test-macstadium-macos11-arm64-1 x
test-macstadium-macos11-arm64-2 x
We could use the config files in
/usr/local/nagios/etc/servers/
to check for machines that are known to Nagios.
Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.
@Willsparker adding to your requirements, I would like to see "Alerts" added to Warn (via Slack) that a given Node "Free disk space" is within a certain margin(say ~3Gb) or our 10Gb Jenkins offline limit,ie. Warn if <13Gb free ! This is so we get a heads up that a node is near to being taken offline by Jenkins and we can act before if fails a nightly build....
We could use the config files in
/usr/local/nagios/etc/servers/
to check for machines that are known to Nagios.Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.
That should absoutely be accurate as far as "live" machines are concerned and I've been acting to resolve any discrepencies as soon as they show up, so hopefully we don't need a ticket with a long list of todos on that one ;-) Just PRs put in to fix them. So yes that list should be definitive.
This was a question as to whether to keep it. We think we should, but we need to bring it into our monitoring regime.