NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.53k stars 445 forks source link

Nagios 4.2.4 segfault with thruk + livestatus and "submit passive result" #391

Closed wleese closed 7 years ago

wleese commented 7 years ago

Hi,

Recently upgraded to Nagios 4.2.4 (avoiding 4.3 due to known mod_gearman issues) and since then have had trouble with Nagios running nicely for hours and then without any apparent reason, it stops working.

nagios.debug:

Host is not flapping (0.00% state change).
[1499343883.843234] [064.1] [pid=19873] Making callbacks (type 2)...
[1499343883.843238] [001.0] [pid=19873] macros_to_kvv()
[1499343883.843241] [016.1] [pid=19873] Rescheduling next check of host at Thu Jul  6 14:29:43 2017
[1499343883.843245] [001.0] [pid=19873] get_next_valid_time()
[1499343883.843248] [001.0] [pid=19873] schedule_host_check()
[1499343883.843243] [064.1] [pid=19873] Making callbacks (type 21)...
[1499343883.843253] [016.0] [pid=19873] Scheduling a non-forced, active check of host 'xxxx' @ Thu Jul  6 14:29:43 2017
[1499343883.843258] [001.0] [pid=19873] [1499343883.843253] [064.1] [pid=19873] Making callbacks (type 20)...
add_event()
[1499343883.843265] [064.1] [pid=19873] Making callbacks (type 12)...
[1499343883.843263] [001.0] [pid=19873] get_next_service_notification_time()
[1499343883.843268] [064.1] [pid=19873] Making callbacks (type 12)...
[1499343883.843273] [016.1] [pid=19873] ** Async check result for host 'xxx handled: new state=0
[1499343883.843275] [032.0] [pid=19873] 2 contacts were notified.  Next possible notification time: Thu Jul  6 14:24:43 2017
[1499343883.843276] [064.1] [pid=19873] Making callbacks (type 7)...
[1499343883.843283] [032.0] [pid=19873] 2 contacts were notified.
[1499343883.843285] [064.1] [pid=19873] Making callbacks (type 5)...
[1499343883.843289] [001.0] [pid=19873] handle_timed_event() start
[1499343883.843292] [064.1] [pid=19873] Making callbacks (type 13)...
[1499343883.843292] [064.1] [pid=19873] Making callbacks (type 1)...
[1499343883.843295] [001.0] [pid=19873] handle_service_event()
[1499343883.843301] [064.1] [pid=19873] Making callbacks (type 23)...
[1499343883.843304] [001.0] [pid=19873] run_global_service_event_handler()
[1499343883.843308] [001.0] [pid=19873] clear_volatile_macros_r()
[1499343883.843309] [008.0] [pid=19873] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Thu Jul  6 14:24:43 2017
[1499343883.843312] [008.0] [pid=19873] ** Service Check Event ==> Host: 'xxxx', Service: 'Snapshot space usedxxx', Options: 0, Latency: 0.000000 sec
[1499343883.843315] [001.0] [pid=19873] run_scheduled_service_check() start
[1499343883.843316] [016.0] [pid=19873] Attempting to run scheduled check of service 'xxx': check options=0, latency=0.000000
[1499343883.843319] [001.0] [pid=19873] [1499343883.843317] [016.1] [pid=19873] Rescheduling next check of service at Thu Jul  6 14:24:43 2017
run_async_service_check()
[1499343883.843325] [001.0] [pid=19873] check_service_check_viability()
[1499343883.843323] [001.0] [pid=19873] get_next_valid_time()
[1499343883.843330] [001.0] [pid=19873] check_time_against_period()
[1499343883.843335] [001.0] [pid=19873] _get_matching_timerange()
[1499343883.843337] [001.0] [pid=19873] _get_matching_timerange()
[1499343883.843339] [001.0] [pid=19873] check_service_dependencies()
[1499343883.843341] [001.0] [pid=19873] check_service_parents()
[1499343883.843343] [064.1] [pid=19873] Making callbacks (type 6)...
[1499343883.843341] [064.1] [pid=19873] Making callbacks (type 6)...

mod_gearman neb (3.0.4):

[2017-07-06 14:24:43][19873][DEBUG] passing by local servicecheck: xxxx - xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local servicecheck: xxxx - xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local servicecheck: xxx - xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local servicecheck: xxxx - xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local servicecheck: xxxx -xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local hostcheck: xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local hostcheck: xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local hostcheck: xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local servicecheck: xxxx- xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local servicecheck: xxxx - xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local hostcheck: xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local hostcheck: xxx
[2017-07-06 14:24:43][19873][DEBUG] passing by local hostcheck: xxx

nagios.log

[1499343864] SERVICE ALERT: applications_acc;pci;CRITICAL;HARD;1;xxxxx started
[1499343878] SERVICE ALERT: xxxx;System Mounts;WARNING;SOFT;2;WARNING - 4 plugins checked, 1 warning (local_diskspace), 3 ok
[1499343878] SERVICE ALERT:xxxxx;xxxx;OK;SOFT;2;xxxxxx\n
[1499343881] SERVICE ALERT: applications_acc;rpg;CRITICAL;HARD;1;xxxx started
[1499343883] SERVICE ALERT: xxxx;xxxx;OK;HARD;1;0
[1499343883] SERVICE NOTIFICATION: team2s_email_pro;xxxx;xxxx;OK;notify-service-by-email;0
[1499343883] SERVICE NOTIFICATION: team2s_phase2_sms_app;xxxx;xxx;OK;notify-team-service-by-sms;0

In the journal I get:

Jul 06 14:24:43 xxxxx nagios[19893]: job 19593 (pid=31154): read() returned error 11

..but this message didn't occur the first time nagios crashed, just the second so likely not related.

Unable to get a backtrace due to the process disappearing.

Other details: gearman_top shows that the moment the issue happened:

Anything else I should be looking at?

wleese commented 7 years ago

Another crash:

[1499683914.648111] [032.0] [pid=22747] Notification viability test failed.  No notification will be sent out.
[1499683914.648113] [001.0] [pid=22747] handle_service_event()
[1499683914.648114] [064.1] [pid=22747] Making callbacks (type 23)...
[1499683914.648116] [001.0] [pid=22747] run_global_service_event_handler()
[1499683914.648118] [001.0] [pid=22747] clear_volatile_macros_r()
[1499683914.648121] [016.1] [pid=22747] Rescheduling next check of service at Mon Jul 10 12:56:54 2017
[1499683914.648122] [001.0] [pid=22747] get_next_valid_time()
[1499683914.648125] [001.0] [pid=22747] _get_matching_timerange()
[1499683914.648129] [064.1] [pid=22747] Making callbacks (type 6)...

No clear pattern in nagios.log of mod_gearman_neb.log. nagios.debug however does look similar. What are these type 6 callbacks?

wleese commented 7 years ago

Another crash

Jul 12 16:43:21 shd-mxxx nagios[564]: job 584 (pid=31910): read() returned error 11
Jul 12 16:43:33 shd-mxxxnagios[579]: job 657 (pid=2672): read() returned error 11
Jul 12 16:44:22 shd-mxxx nagios[580]: job 844 (pid=14900): read() returned error 11
Jul 12 16:45:15 shd-mxxx nagios[572]: job 1086 (pid=26422): read() returned error 11
Jul 12 16:48:24 shd-mxxx nagios[586]: job 1838 (pid=2081): read() returned error 11
Jul 12 16:48:24 shd-mxxx nagios[586]: job 1838 (pid=2081): read() returned error 11
Jul 12 16:49:33 shd-mxxx nagios[580]: job 2118 (pid=18098): read() returned error 11
Jul 12 16:49:45 shd-mxxx systemd[1]: nagios.service: main process exited, code=killed, status=11/SEGV
Hint: Some lines were ellipsized, use -l to show in full.

Some thing we noticed: this never happens in the weekend or at night, when no one is touching nagios. Further more, we do hourly nagios reloads. The recent crash seemed to happen when someone wanted to schedule downtime. Maybe related, maybe not. Still haven't been able to reproduce.

hedenface commented 7 years ago

@wleese Type 6 is NEBCALLBACK_SERVICE_CHECK_DATA.

Sorry for not responding as much lately. I have been looking into this. The read() returned error 11 happens in the function gather_output in lib/worker.c.

I have an idea of what is causing it, but it'll take some time to replicate.

Just a curiosity: how many hosts and services do you have?

wleese commented 7 years ago

Perhaps a bit verbose, but here's a checkconfig:

Nagios Core 4.2.4
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 12-07-2016
License: GPL

Website: https://www.nagios.org
Reading configuration data...
Warning: use_embedded_perl_implicitly is deprecated and will be removed.
Warning: enable_embedded_perl is deprecated and will be removed.
Warning: p1_file is deprecated and will be removed.
Warning: sleep_time is deprecated and will be removed.
Warning: external_command_buffer_slots is deprecated and will be removed. All commands are always processed upon arrival
Warning: command_check_interval is deprecated and will be removed. Commands are always handled on arrival
   Read main config file okay...
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in host type objects (config file '/etc/nagios/auto.d/settings/templates.cfg', starting at line 5)
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in service type objects (config file '/etc/nagios/auto.d/settings/templates.cfg', starting at line 24)
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in host type objects (config file '/etc/nagios/conf.d/init/templates.cfg', starting at line 5)
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in host type objects (config file '/etc/nagios/conf.d/init/templates.cfg', starting at line 23)
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in service type objects (config file '/etc/nagios/conf.d/init/templates.cfg', starting at line 40)
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in service type objects (config file '/etc/nagios/conf.d/init/templates.cfg', starting at line 66)
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in host type objects (config file '/etc/nagios/conf.d/check_mk_templates.cfg', starting at line 88)
Warning: failure_prediction_enabled is obsoleted and no longer has any effect in service type objects (config file '/etc/nagios/conf.d/check_mk_templates.cfg', starting at line 157)
   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
    Checked 110703 services.
    Checked 5467 hosts.
    Checked 1375 host groups.
    Checked 0 service groups.
    Checked 2250 contacts.
    Checked 1362 contact groups.
    Checked 6237 commands.
    Checked 10 time periods.
    Checked 0 host escalations.
    Checked 0 service escalations.
Checking for circular paths...
    Checked 5467 hosts
    Checked 150478 service dependencies
    Checked 0 host dependencies
    Checked 10 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check
hedenface commented 7 years ago

Holy #%!?

That explains it. Okay. Here's the full deal (and this is something we'll be changing in Core5, but until we can change some inner guts out you're going to be stuck).

Nagios utilizes a unix domain socket for worker communication. The main fork() starts up, schedules events, and then sends the events to workers over this socket. It is a blocking socket, which means that all of the workers are paused while waiting for data, or when one server is sending data back.

There is a lot of communication going on in this socket on your system. I'll come up with a quick fix for you, I think in your case we can adjust the timeout option for the socket and that should alleviate your iobroker being unavailable to the worker.

wleese commented 7 years ago

Note that we had no issues running Nagios 3 - although check latency becomes a thing after reloads (20 minute latencies). This amount of config seems to be nicely supported by Nagios4, ignoring the segfaults ;)

Thanks for the explanation - eagerly awaiting a fix. Btw, does this also apply where most of these checks are being run through the eventbroker -> mod_gearman?

hedenface commented 7 years ago

Okay, let's try this. I hope you're up to changing some code:

In lib/worker.c find the function called worker_set_sockopts, and make it look like this:

int worker_set_sockopts(int sd, int bufsize)
{
    int ret;

    ret = fcntl(sd, F_SETFD, FD_CLOEXEC);
    ret |= fcntl(sd, F_SETFL, O_NONBLOCK);

    if (!bufsize)
        return ret;

    struct timeval timeout;
    timeout.tv_sec = 60;
    timeout.tv_usec = 0;
    ret |= setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, (const char*)&timeout, sizeof(struct timeval));

    ret |= setsockopt(sd, SOL_SOCKET, SO_SNDBUF, &bufsize, sizeof(int));
    ret |= setsockopt(sd, SOL_SOCKET, SO_RCVBUF, &bufsize, sizeof(int));

    return ret;
}

And then recompile and reinstall.

Please keep me posted as to what happens.

ericloyd commented 7 years ago

You have more than 110,000 service checks? And that was running fine on Nagios 3? I'm literally amazed. Nagios 3 forked an entire copy of itself to perform each service check. That's a lot of overhead for 110,000 services and 5000 hosts! Nagios 4 introduced some major scheduling and forking improvements that I'm surprised didn't make things better.

Be that as it may, professionally speaking, I think you should consider a federated approach at this point - multiple workers doing the checks and a single Nagios doing the scheduling/reporting. It will make it much easier to scale.

hedenface commented 7 years ago

@ericloyd I believe he has modgearman enabled if I'm not mistaken. There's still a bottleneck in the scheduling at that point. This is (one of) the exact problem I'm trying to solve with Core5.

ericloyd commented 7 years ago

I must have missed that somewhere. Cuz I call bullsh*t on a Nagios 3 box running 5000+ hosts with 100000+ services on its own. :-)

wleese commented 7 years ago

No prob. Can apply this to 4.2.4 as well? we're avoiding 4.3.x due to the mod gearman issues and are already running a patched 4.2.4 with the fix that prevents hanging with a simultaneous reload and external command.

hedenface commented 7 years ago

@wleese Yes, it should be fine.

wleese commented 7 years ago

Alright. I'll deploy this on Tuesday on one of our machines. The segfaults sometimes don't happen for days, so it might take a bit before I report back.

wleese commented 7 years ago

Heads up: we're running the patch on a single node and waiting patiently for trouble. Makes me curious: would the lack of a timeout on the worker socket result in a segfault?

wleese commented 7 years ago
Jul 20 14:01:20 xxx systemd[1]: nagios.service failed.
Jul 20 14:01:20 xxx systemd[1]: Unit nagios.service entered failed state.
Jul 20 14:01:19 xxx systemd[1]: nagios.service: main process exited, code=killed, status=11/SEGV

Sadly the patch didn't help

hedenface commented 7 years ago

Do you have any additional info on the one that was patched?

wleese commented 7 years ago

not yet. I failed to copy away the debug log before restarting. I was trying to get a core dump last week, but failed to configure the system correctly. would that be of any use? If so, I'll get back to working on that.

hedenface commented 7 years ago

I can't say definitively if it is of any use until I see it, unfortunately. Potentially anything could be. Regardless, I think we're going to overhaul the way the socket communicates with the main fork and workers in Core5 so hopefully this won't even be an issue in the future.

Until then, I'd need to reproduce the issue to really help debug it. It looks like a combo of LOTS of checks and less than optimal system resources (for the amount of checks).

Speaking of that, whats your swappiness? What does free say? How much RAM and CPU does the server have?

wleese commented 7 years ago

We never have performance issues with nagios really, but here you go

root@xxxx-002 ~# free -g
              total        used        free      shared  buff/cache   available
Mem:            125           3          78           4          43         117
Swap:            15           0          15

root@xxxx-002 ~# sysctl -a|grep swap
vm.swappiness = 10

16 cores:

processor   : 15
vendor_id   : GenuineIntel
cpu family  : 6
model       : 79
model name  : Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
stepping    : 1
microcode   : 0xb000012
cpu MHz     : 3200.000
cache size  : 25600 KB
physical id : 0
siblings    : 16
core id     : 12
cpu cores   : 8
apicid      : 25
initial apicid  : 25
fpu     : yes
fpu_exception   : yes
cpuid level : 20
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc
bogomips    : 6392.60
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

nagiosstats:

Nagios Stats 4.2.4
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 12-07-2016
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/log/nagios/status.dat
Status File Age:                        0d 0h 0m 14s
Status File Version:                    4.2.4

Program Running Time:                   0d 0h 28m 20s
Nagios PID:                             20386

Total Services:                         111202
Services Checked:                       85317
Services Scheduled:                     79090
Services Actively Checked:              108176
Services Passively Checked:             3026
Total Service State Change:             0.000 / 100.000 / 0.646 %
Active Service Latency:                 0.000 / 174.778 / 0.370 sec
Active Service Execution Time:          0.000 / 30.060 / 0.356 sec
Active Service State Change:            0.000 / 37.890 / 0.056 %
Active Services Last 1/5/15/60 min:     6074 / 56478 / 70427 / 78614
Passive Service Latency:                0.001 / 5097.830 / 47.332 sec
Passive Service State Change:           0.000 / 100.000 / 21.728 %
Passive Services Last 1/5/15/60 min:    6 / 53 / 128 / 265
Services Ok/Warn/Unk/Crit:              110093 / 577 / 97 / 435
Services Flapping:                      34
Services In Downtime:                   294

Total Hosts:                            5492
Hosts Checked:                          4615
Hosts Scheduled:                        4615
Hosts Actively Checked:                 5492
Host Passively Checked:                 0
Total Host State Change:                0.000 / 12.370 / 0.193 %
Active Host Latency:                    0.000 / 10.645 / 0.285 sec
Active Host Execution Time:             0.000 / 10.136 / 0.125 sec
Active Host State Change:               0.000 / 12.370 / 0.193 %
Active Hosts Last 1/5/15/60 min:        86 / 4608 / 4615 / 4615
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  5415 / 77 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      27

Active Host Checks Last 1/5/15 min:     48 / 1758 / 4868
   Scheduled:                           40 / 1607 / 4583
   On-demand:                           8 / 151 / 285
   Parallel:                            40 / 1607 / 4583
   Serial:                              0 / 0 / 0
   Cached:                              8 / 151 / 285
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  3348 / 27802 / 84383
   Scheduled:                           3348 / 27802 / 84383
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 11 / 407 / 617

External Commands Last 1/5/15 min:      22 / 438 / 923

Until then, I'd need to reproduce the issue to really help debug it. It looks like a combo of LOTS of checks and less than optimal system resources (for the amount of checks).

We suspect it has something to do with user initiated tasks, considering we've never had a segfault outside of office hours.

wleese commented 7 years ago

Alright, so again this problem seems related to our users downtiming lots of machines. Here's some logging by thruk:

[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_HOST_DOWNTIME;xhostx-011.xxxx;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_HOST_DOWNTIME;xhostx-012.xxxx;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_HOST_DOWNTIME;xhostx-014.xxxx;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_HOST_DOWNTIME;xhostx-013.xxxx;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Avg CPU usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Power Supply Units;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Avg memory usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Fans;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/1;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/10;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/11;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/2;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/4;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/41;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/42;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/43;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/44;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/45;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/46;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/47;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/48;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/5;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/7;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Int e-1/8;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Module;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-011.xxxx;Power Supply Units;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Avg CPU usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Avg memory usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Fans;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/1;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/10;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/11;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/2;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/4;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/41;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/42;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/43;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/44;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/45;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/46;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/47;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/48;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/5;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/7;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Int e-1/8;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Module;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-012.xxxx;Power Supply Units;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Avg CPU usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Avg memory usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Fans;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/1;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/10;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/11;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/2;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/4;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/41;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/42;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/43;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/44;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/45;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/46;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/47;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/48;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/5;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/7;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Int e-1/8;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Module;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-013.xxxx;Power Supply Units;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Avg CPU usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Avg memory usage;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Fans;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/1;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/10;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/11;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/2;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/4;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/41;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/42;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/43;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/44;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/45;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/46;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/47;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/48;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/5;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/7;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Int e-1/8;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:08][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552068] SCHEDULE_SVC_DOWNTIME;xhostx-014.xxxx;Module;1500551940;1500559140;1;0;7200;someuser;Upgrade to version 12.3R12.4
[2017/07/20 14:01:18][xxxx-002.xxxx][INFO][Thruk] [someuser][Core] cmd: COMMAND [1500552078] PROCESS_SERVICE_CHECK_RESULT;xhostx-008.xxxx;trap-chassis;0;OK|

After that last line:

[2017/07/20 14:01:20][xxxx][ERROR][Thruk] No Backend available
hedenface commented 7 years ago

Ooooh - that makes a bit of sense.

Can you create about 200 checks for a test? Create them - use check_dummy or something - and then schedule them all into downtime and see if it reproduces the issue consistently.

wleese commented 7 years ago

I just tested this with a local instance - was able to downtime 3000~ checks (those were in status pending tho..) without crashing Nagios. Then proceeded to randomly downtime hundreds and revert that.

Downtiming tens of machines is quite normal in our infrastructure, so I'm afraid testing in a synthetic environment won't prove helpful.

wleese commented 7 years ago

Just curious, would it matter that we use nsca-ng which relies on nagios.cmd? On the likely assumption that thruk uses the command_file as well. (grabbing straws :))

wleese commented 7 years ago

Another crash.

[1500967475.184987] [016.0] [pid=20016] Checking service 'xxxx' on host 'xxxx'...
[1500967475.184989] [001.0] [pid=20016] get_raw_command_line_r()
[1500967475.184992] [001.0] [pid=20016] process_macros_r()
[1500967475.184993] [2048.1] [pid=20016] **** BEGIN MACRO PROCESSING ***********
[1500967475.184995] [2048.1] [pid=20016] Processing: 'xxxx'
[1500967475.184997] [2048.1] [pid=20016]   Done.  Final output: 'xxxx'
[1500967475.184999] [2048.1] [pid=20016] **** END MACRO PROCESSING *************
[1500967475.185001] [001.0] [pid=20016] process_macros_r()
[1500967475.185002] [2048.1] [pid=20016] **** BEGIN MACRO PROCESSING ***********
[1500967475.185004] [2048.1] [pid=20016] Processing: 'noargs'
[1500967475.185006] [2048.1] [pid=20016]   Done.  Final output: 'noargs'
[1500967475.185008] [2048.1] [pid=20016] **** END MACRO PROCESSING *************
[1500967475.185010] [001.0] [pid=20016] process_macros_r()
[1500967475.185011] [2048.1] [pid=20016] **** BEGIN MACRO PROCESSING ***********
[1500967475.185013] [2048.1] [pid=20016] Processing: '$USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$ $ARG8$ $ARG9$'
[1500967475.185019] [2048.1] [pid=20016]   Done.  Final output: '/usr/lib64/nagios/plugins/check_nrpe -t 30 -H xxxx -c xxxx -a noargs       '
[1500967475.185021] [2048.1] [pid=20016] **** END MACRO PROCESSING *************
[1500967475.185023] [064.1] [pid=20016] Making callbacks (type 6)...
[1500967475.185028] [001.0] [pid=20016] macros_to_kvv()
[1500967475.185035] [001.0] [pid=20016] clear_volatile_macros_r()
[1500967475.185041] [001.0] [pid=20016] handle_timed_event() end
[1500967475.185043] [064.1] [pid=20016] Making callbacks (type 1)...
[1500967475.185050] [008.1] [pid=20016] ** Event Check Loop
[1500967475.185056] [008.1] [pid=20016] Next Event Time: Tue Jul 25 09:24:35 2017
[1500967475.185058] [008.1] [pid=20016] Current/Max Service Checks: 293/0 (inf% saturation)
[1500967475.185060] [12288.1] [pid=20016] ## Polling 0ms; sockets=26; events=83813; iobs=0x7f1c56581110
[1500967475.185063] [001.0] [pid=20016] handle_timed_event() start
[1500967475.185064] [064.1] [pid=20016] Making callbacks (type 1)...
[1500967475.185070] [008.0] [pid=20016] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Tue Jul 25 09:24:35 2017
[1500967475.185072] [008.0] [pid=20016] ** Service Check Event ==> Host: 'xxxx', Service: 'xxxx', Options: 0, Latency: 0.160017 sec
[1500967475.185075] [001.0] [pid=20016] run_scheduled_service_check() start
[1500967475.185077] [016.0] [pid=20016] Attempting to run scheduled check of service 'xxxx' on host 'xxxx': check options=0, latency=0.160017
[1500967475.185079] [001.0] [pid=20016] run_async_service_check()
[1500967475.185081] [001.0] [pid=20016] check_service_check_viability()
[1500967475.185082] [001.0] [pid=20016] check_time_against_period()
[1500967475.185087] [001.0] [pid=20016] _get_matching_timerange()
[1500967475.185091] [001.0] [pid=20016] check_service_dependencies()
[1500967475.185093] [001.0] [pid=20016] check_service_dependencies()
[1500967475.185095] [001.0] [pid=20016] check_service_dependencies()
[1500967475.185097] [001.0] [pid=20016] check_service_parents()
[1500967475.185099] [064.1] [pid=20016] Making callbacks (type 6)...
[1500967475.185124] [001.0] [pid=20016] clear_volatile_macros_r()
[1500967475.185127] [001.0] [pid=20016] get_raw_command_line_r()
[1500967475.185129] [001.0] [pid=20016] process_macros_r()
[1500967475.185131] [2048.1] [pid=20016] **** BEGIN MACRO PROCESSING ***********
[1500967475.185133] [2048.1] [pid=20016] Processing: 'xxxx'
[1500967475.185135] [2048.1] [pid=20016]   Done.  Final output: 'xxxx'
[1500967475.185137] [2048.1] [pid=20016] **** END MACRO PROCESSING *************
[1500967475.185138] [001.0] [pid=20016] process_macros_r()
[1500967475.185140] [2048.1] [pid=20016] **** BEGIN MACRO PROCESSING ***********
[1500967475.185141] [2048.1] [pid=20016] Processing: 'noargs'
[1500967475.185143] [2048.1] [pid=20016]   Done.  Final output: 'noargs'
[1500967475.185145] [2048.1] [pid=20016] **** END MACRO PROCESSING *************
[1500967475.185147] [001.0] [pid=20016] process_macros_r()
[1500967475.185148] [2048.1] [pid=20016] **** BEGIN MACRO PROCESSING ***********
[1500967475.185150] [2048.1] [pid=20016] Processing: '$USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$ $ARG8$ $ARG9$'
[1500967475.185156] [2048.1] [pid=20016]   Done.  Final output: '/usr/lib64/nagios/plugins/check_nrpe -t 30 -H xxxxx -c xxxx -a noargs       '
[1500967475.185158] [2048.1] [pid=20016] **** END MACRO PROCESSING *************
[1500967475.185277] [016.0] [pid=20016] Check of service 'xxxx' on host 'xxxx' (id=102464) was overridden by a module
[1500967475.185282] [001.0] [pid=20016] handle_timed_event() end
[1500967475.185284] [064.1] [pid=20016] Making callbacks (type 1)...
[1500967475.185290] [008.1] [pid=20016] ** Event Check Loop
[1500967475.185298] [008.1] [pid=20016] Next Event Time: Tue Jul 25 09:24:35 2017
[1500967475.185300] [008.1] [pid=20016] Current/Max Service Checks: 294/0 (inf% saturation)
[1500967475.185303] [12288.1] [pid=20016] ## Polling 0ms; sockets=26; events=83812; iobs=0x7f1c56581110
[1500967475.185305] [001.0] [pid=20016] handle_timed_event() start
[1500967475.185307] [064.1] [pid=20016] Making callbacks (type 1)...
[1500967475.185313] [008.0] [pid=20016] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Tue Jul 25 09:24:35 2017
[1500967475.185319] [008.0] [pid=20016] ** Service Check Event ==> Host: 'xxxx', Service: 'xxxx', Options: 0, Latency: 0.159967 sec
[1500967475.185323] [001.0] [pid=20016] run_scheduled_service_check() start
[1500967475.185324] [016.0] [pid=20016] Attempting to run scheduled check of service 'xxxx' on host 'xxxx': check options=0, latency=0.159967

Again with someone using Thruk:

[2017/07/25 09:24:30][xxx][INFO][Thruk] [xxx][Core] cmd: COMMAND [1500967470] PROCESS_SERVICE_CHECK_RESULT;xxxx;load_enrichment_status;0;Old|                                                                                                                        
[2017/07/25 09:24:39][xxx][INFO][Thruk] [xxxx][Core] cmd: COMMAND [1500967479] PROCESS_SERVICE_CHECK_RESULT;xxx;load_parties;0;Old|
[2017/07/25 09:24:42][xxx][ERROR][Thruk] No Backend available
hedenface commented 7 years ago

So what are all the broker modules you're using with Nagios?

cat /usr/local/nagios/etc/nagios.cfg | grep broker
wleese commented 7 years ago
# Event broker
event_broker_options=-1
broker_module=/usr/lib64/check_mk/livestatus.o /var/spool/nagios/cmd/live max_cached_messages=10000000
broker_module=/usr/lib64/mod_gearman/mod_gearman_nagios4.o config=/etc/mod_gearman/mod_gearman_neb.conf

Sadly I cannot test with brokers disabled.

mod_gearman-3.0.0
check-mk-livestatus-1.2.8p24
hedenface commented 7 years ago

Well, I have an idea, if you're up for it...

Hopefully, this is on a VM.

I suggest cloning it. If that isn't an option - then simply recreating it to the best of your ability would be fine as well.

Then, make sure every single service is set to passive.

Connect one of your front ends to it and go crazy scheduling downtime.

Also, this may interest you: https://github.com/sni/mod_gearman/issues/110 - it appears to be resolved when compiled with the appropriate headers.

wleese commented 7 years ago

I just experienced the crash first hand. My colleague experienced the crash in the morning and when I joined him and we were able to reproduce the crash a second time, but not again after that.

It happens when we submit a passive result. Here's the log from thruk:

[2017/07/26 09:52:28][xxx][INFO][Thruk] [username][Core] cmd: COMMAND [1501055548] PROCESS_SERVICE_CHECK_RESULT;hostname;servicename;0;JL: Due to flexclone|

I have a vagrant setup which I loaded with our complete config. Then I disabled all active checks and tried to reproduce this. I could not reproduce the crash locally.

wleese commented 7 years ago

Needless to say it's unclear which component is to blame here. I've created an issue for Thruk as well https://github.com/sni/Thruk/issues/741

wleese commented 7 years ago

Ok, I was able to reproduce this locally.

nagios.debug:

[1501057396.134888] [024.1] [pid=20635] We're not executing service checks right now, so we'll skip check event for service 'xxxx;System Processes'.
[1501057396.134915] [064.1] [pid=20635] Making callbacks (type 1)...
[1501057396.134928] [001.0] [pid=20635] reschedule_event()
[1501057396.134932] [001.0] [pid=20635] add_event()
[1501057396.134936] [064.1] [pid=20635] Making callbacks (type 13)...
[1501057396.134939] [008.1] [pid=20635] ** Event Check Loop
[1501057396.134957] [008.1] [pid=20635] Next Event Time: Wed Jul 26 10:23:16 2017
[1501057396.134960] [008.1] [pid=20635] Current/Max Service Checks: 1/0 (inf% saturation)
[1501057396.134964] [12288.1] [pid=20635] ## Polling 13ms; sockets=6; events=83289; iobs=0x7f13cb93f110
[1501057396.141547] [001.0] [pid=20635] process_external_command1()
[1501057396.141571] [064.1] [pid=20635] Making callbacks (type 17)...
[1501057396.141577] [001.0] [pid=20635] process_external_command2()
[1501057396.141581] [128.1] [pid=20635] External Command Type: 30
[1501057396.141584] [128.1] [pid=20635] Command Entry Time: 1501057396
[1501057396.141587] [128.1] [pid=20635] Command Arguments: xxxx;xxxxx;0;JL: Due to flexclone|
[1501057396.141596] [001.0] [pid=20635] handle_async_service_check_result()
[1501057396.141601] [016.0] [pid=20635] ** Handling check result for service 'xxxx' on host 'xxxx' from 'command file worker'...
[1501057396.141604] [016.1] [pid=20635] HOST: xxxx, SERVICE: XXXXXXX, CHECK TYPE: Passive, OPTIONS: 0, SCHEDULED: No, RESCHEDULE: No, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: JL: Due to flexclone|
[1501057396.141611] [001.0] [pid=20635] get_service_check_return_code()
[1501057396.141624] [016.1] [pid=20635] Service is OK.
[1501057396.141628] [016.1] [pid=20635] Host is NOT UP, so we'll check it to see if it recovered...
[1501057396.141632] [001.0] [pid=20635] schedule_host_check()
[1501057396.141647] [016.0] [pid=20635] Scheduling a non-forced, active check of host 'xxxx' @ Wed Jul 26 10:23:16 2017
[1501057396.141652] [001.0] [pid=20635] add_event()
[1501057396.141659] [064.1] [pid=20635] Making callbacks (type 12)...
[1501057396.141661] [016.1] [pid=20635] Service did not change state.
[1501057396.141664] [064.1] [pid=20635] Making callbacks (type 6)...

The debug log looks good. Nicely showing how Nagios picks up the command from thruk. Definitely looks like livestatus is not relevant here, seemingly pointing to a problem with Nagios itself.

wleese commented 7 years ago

ok, now that we're able to do this locally, I hooked up gdb:

(gdb) catch signal SIGSEGV
Catchpoint 1 (signal SIGSEGV)
(gdb) continue
Continuing.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f463ede8700 (LWP 29888)]
0x00007f463e4f7cad in write () from /lib64/libc.so.6
(gdb) continue
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007f463e4f7cad in write () from /lib64/libc.so.6
(gdb) continue
Continuing.
^[[A

Catchpoint 1 (signal SIGSEGV), 0x00007f463debabf3 in handle_svc_check (event_type=6, data=0x7f463ede7a00)
    at neb_module_nagios4/../neb_module/mod_gearman.c:1141
1141    neb_module_nagios4/../neb_module/mod_gearman.c: No such file or directory.
(gdb) continue
Continuing.

Catchpoint 1 (signal SIGSEGV), 0x00007f463debabf3 in handle_svc_check (event_type=6, data=0x7f463ede7a00)
    at neb_module_nagios4/../neb_module/mod_gearman.c:1141
1141    in neb_module_nagios4/../neb_module/mod_gearman.c

(gdb) bt
#0  0x00007f463debabf3 in handle_svc_check (event_type=6, data=0x7f463ede7a00) at neb_module_nagios4/../neb_module/mod_gearman.c:1141
#1  0x00007f463ef10aff in neb_make_callbacks ()
#2  0x00007f463ef0eee0 in broker_service_check ()
#3  0x00007f463ef1abd3 in handle_async_service_check_result ()
#4  0x00007f463ef2306e in process_passive_service_check ()
#5  0x00007f463ef23193 in cmd_process_service_check_result ()
#6  0x00007f463ef28c15 in process_external_command1 ()
#7  0x00007f463cca9cd4 in Store::answerCommandRequest(char const*) () from /usr/lib64/check_mk/livestatus.o
#8  0x00007f463ccaa2bd in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/check_mk/livestatus.o
#9  0x00007f463cca99e6 in store_answer_request () from /usr/lib64/check_mk/livestatus.o
#10 0x00007f463cce8540 in client_thread () from /usr/lib64/check_mk/livestatus.o
#11 0x00007f463dc96dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f463e50676d in clone () from /lib64/libc.so.6

Nagios.log:

[1501059260] Caught SIGTERM, shutting down...

https://github.com/sni/mod_gearman/blob/v3.0.4/neb_module/mod_gearman.c#L1124

hedenface commented 7 years ago

I actually had a similar problem a long time ago when I was doing some NEB stuff with gearmand. The broken pipe is an issue with the gearman daemon itself.

You should be able to solve it a few ways - I forget specifically - but I think a workaround was to increase open file limits for the gearman user - otherwise you can recompile libgearman with the newest version.

This is unfortunately not a Core issue I think and a ModGearman issue. As you can see from the backtrace - there is quite a bit more than Nagios source being touched, But, as I posted in my last correspondence, ModGearman compiles fine with the appropriate Nagios headers apparently.

In other news, native core remote workers are being worked on for the Core 5 release :)

wleese commented 7 years ago

Looking at https://github.com/sni/mod_gearman/issues/122#issuecomment-318012628 it seems that there's some memory corruption going on. My only question is that because it looks like a memory corruption issue, can this simply be considered a mod_gearman issue or could the corruption have been caused elsewhere.

File limits were raised a while back:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514510
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768

native core remote workers are being worked on for the Core 5 release cool, but I literally can't wait for a new release ;)

hedenface commented 7 years ago

Increase the open files waaaayyy beyond that. I'm not sure why, or what caused it, but I had this issue in a development environment while I was doing some neat stuff.

I saw that comment, and I still think it's likely a ModGearman (BECAUSE OF libgearman) issue.

wleese commented 7 years ago

I still think it's likely a ModGearman (BECAUSE OF libgearman) issue. Alright. Any further advice on a known good version or is this uncharted territory?

Increase the open files waaaayyy beyond that. Just looked at out metrics and I see that all users in total never use more than 10k file handles.

hedenface commented 7 years ago

Whatever the newest was is what I did. I do not remember the version number.

I thought the same thing in regards to what was actually being opened. And I may be wrong here - but I remember that increasing the open file limit way larger than any had ever been used actually helped somewhat :) Maybe it was something else entirely, but it worked.

wleese commented 7 years ago

https://github.com/sni/mod_gearman/issues/122#issuecomment-318308844

The problem here is that livestatus, calls process_external_command1 from a separate thread which then corrupts the memory. The Naemon/Nagios core is not threadsafe. The workaround was
to send the command to the queryhandler, so Naemon can read/process the command from inside the main thread.
wleese commented 7 years ago

Worked around this issue with this mess:

/mk-livestatus-1.2.8p25/src git:(master) ✗ diff -u Store.cc.org Store.cc
--- Store.cc.org    2017-08-01 10:55:37.555389499 +0200
+++ Store.cc    2017-08-01 12:35:21.541926533 +0200
@@ -21,6 +21,10 @@
 // License along with GNU Make; see the file  COPYING.  If  not,  write
 // to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
 // Boston, MA 02110-1301 USA.
+extern "C" {
+   #include "nagios4/libnagios.h"
+}
+

 #include "Store.h"
 #include <string.h>
@@ -34,6 +38,7 @@
 #include "global_counters.h"
 #include "logger.h"
 #include "strutil.h"
+#include <fstream>

 // TODO(sp): Remove this hack.
 #ifdef EXTERN
@@ -176,10 +181,46 @@
     return output->doKeepalive();
 }

+/* define a fake iobroker_register function for the libnagios call */
+static void fake_iobreg(int fdout, int fderr, void *arg) { }
+
 void Store::answerCommandRequest(const char *command) {
     lock_guard<mutex> lg(_command_mutex);
 #ifdef NAGIOS4
-    process_external_command1((char *)command);
+    const char *nagioscmd_file = "/var/spool/nagios/cmd/nagios.cmd";
+
+    if (std::ifstream(nagioscmd_file)) {
+       int BUFFER = 128;
+       char *cmd;
+       int pfd[2] = {-1, -1};
+       int pfderr[2] = {-1, -1};
+       int fake_iobregarg = 0;
+       int fd;
+       char *out = (char*)calloc(1, BUFFER);
+
+       asprintf(&cmd, "echo \"%s\" > %s", command, nagioscmd_file);
+       fd = runcmd_open(cmd, pfd, pfderr, NULL, fake_iobreg, &fake_iobregarg);
+
+       /* get the output from the stdout file descriptor into the out var */
+       read(pfd[0], out, BUFFER);
+
+       runcmd_close(fd);
+
+       if (g_debug_level > 0) {
+          logger(LG_INFO,
+                "External Command redirected to %s: %s",
+                nagioscmd_file, cmd);
+       }
+
+       /* house-keeping */
+       free(cmd);
+       free(out);
+       close(pfd[0]);
+       close(pfderr[0]);
+       close(fd);
+    } else {
+       logger(LG_INFO, "External Command file missing");
+    }
 #else
     int buffer_items = -1;
     /* int ret = */

That said, I fear Thruk and livestatus have no future.

hedenface commented 7 years ago

Thanks for sharing your findings. I can't speak to the success or future of Thruk and livestatus - but I can say that we have some exciting things (that may or may not be similar features) coming when Core5 is released :)

wleese commented 7 years ago

..and you guys thanks for working with me on this. Looking forward to Core5.