Open pattersdan opened 7 years ago
Are you able to run an strace on it while it's acting up?
What I've been able to grab is attached.
Its mostly
brk(NULL) = 0x2e3c000
brk(0x2e5d000) = 0x2e5d000
brk(NULL) = 0x2e5d000
brk(0x2e7e000) = 0x2e7e000
brk(NULL) = 0x2e7e000
brk(0x2e9f000) = 0x2e9f000
brk(NULL) = 0x2e9f000
brk(0x2ec0000) = 0x2ec0000
brk(NULL) = 0x2ec0000
and
write(1, "der=0 cellspacing=0 cellpadding="..., 4096) = 4096 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0
Try running the following command as the user whoever the cgis run as:
export TZ=:/etc/localtime
Please update the thread with results
I tried the above command. It didn't make much of a difference time wise, though the strace looks a bit different. I watched an strace after a captured this one, and the 7 seconds is when all the
12832 20:04:26 brk(NULL) = 0x2009000 12832 20:04:26 brk(0x202a000) = 0x202a000 12832 20:04:26 brk(NULL) = 0x202a000 12832 20:04:26 brk(0x204b000) = 0x204b000 12832 20:04:26 brk(NULL) = 0x204b000 12832 20:04:26 brk(0x206c000) = 0x206c000
stuff is going by. The page starts to visibly load after when
I've attached the latest strace, and I appreciate your help!! strace3.txt
It looks like it's putting an awful lot of stuff in memory before it actually starts doing anything (5 seconds of brk()
). This is probably some kind of inefficiency in the code on our part - can I see the output of sysctl -a | grep ^vm
and the web config for Core?
Also - how many hosts/services are we talking about?
Its 98 host with about 1300 services. I migrated the config files from a nagios 3.1.2 installation. I had had to tweak them a little bit to get them to work under this new installation, is there a possibility that's the issue? webconf.txt webconf.txt
sudo sysctl -a | grep ^vm sysctl: reading key "net.ipv6.conf.all.stable_secret" sysctl: reading key "net.ipv6.conf.default.stable_secret" sysctl: reading key "net.ipv6.conf.eth0.stable_secret" sysctl: reading key "net.ipv6.conf.lo.stable_secret" vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 30 vm.dirty_writeback_centisecs = 500 vm.drop_caches = 0 vm.extfrag_threshold = 500 vm.hugepages_treat_as_movable = 0 vm.hugetlb_shm_group = 0 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 vm.max_map_count = 65530 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 67584 vm.min_slab_ratio = 5 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.mmap_rnd_bits = 28 vm.mmap_rnd_compat_bits = 8 vm.nr_hugepages = 0 vm.nr_hugepages_mempolicy = 0 vm.nr_overcommit_hugepages = 0 vm.nr_pdflush_threads = 0 vm.numa_zonelist_order = default vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 vm.page-cluster = 3 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.stat_interval = 1 vm.swappiness = 30 vm.user_reserve_kbytes = 131072 vm.vfs_cache_pressure = 100 vm.zone_reclaim_mode = 0
We're observing the exact same behavior with both Nagios 4.1.1 and 4.3.2 on CentOS 6.
Is this repeatable?
Shouldn't a little instrumentation around the malloc()
calls in cgi/status.c
give us a clue? You could configure malloc()
debugging with mtrace()
(see man 3 mtrace
for an example). This should tell us where the allocations are coming from.
FWIW: The difference between the brk() arguments is 0x21000. This is the default glibc heap allocation (including padding).
I've tried a mtrace of status.cgi
by myself. Unfortunately, it does not trace strdup()
's memory allocations.
What I saw: There are many tiny allocations.
Then I tried valgrind's massif tool and came up with this (use another QUERY_STRING that suits you):
$ REMOTE_USER=nagiosadmin REQUEST_METHOD=GET QUERY_STRING="hostgroup=REDACTED&style=detail&hoststatustypes=3&serviceprops=42&servicestatustypes=28&hostprops=42" valgrind --xtree-memory=full --tool=massif ./status.cgi
$ ms_print massif.out.27107 | tee massif.report.txt
$ callgrind_annotate --auto=yes --inclusive=yes --sort=curB:100,curBk:100,totB:100,totBk:100,totFdB:100,totFdBk:100 xtmemory.kcg.27107 |tee xtmemory.report.txt
Here's the beginning of the generated massif.report.txt
:
--------------------------------------------------------------------------------
Command: ./status.cgi
Massif arguments: --xtree-memory=full
ms_print arguments: massif.out.27107
--------------------------------------------------------------------------------
MB
150.3^ #
| ::::::#
| ::::: #
| ::::: # :
| ::::: # @:::::::::
| :::::: # :::::@:::: :
| :::::: # ::::::::: :@:::: :
| :::::: # ::::@: ::: ::: :@:::: :
| :::::: # ::::::::@: ::: ::: :@:::: :
| :::::: #:::::::: ::::@: ::: ::: :@:::: :
| :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| ::@@:::::: #:::: ::: ::::@: ::: ::: :@:::: :
| :::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| @@:::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| @@:@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| @@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| ::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| :@@::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| @@@:@ ::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
| @@:@ @:@ ::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@:::: :
0 +----------------------------------------------------------------------->Gi
0 32.76
Number of snapshots: 58
Detailed snapshots: [2, 4, 5, 7, 10, 11, 12, 13, 15, 16, 22, 29 (peak), 41, 51]
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
0 0 0 0 0 0
1 546,311,263 4,470,808 3,276,914 1,193,894 0
2 1,403,757,742 9,289,408 6,588,744 2,700,664 0
70.93% (6,588,744B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->23.73% (2,204,425B) 0x4EB9808: strdup (in /usr/lib64/libc-2.17.so)
| ->06.55% (608,133B) in 47 places, all below massif's threshold (1.00%)
| |
| ->06.42% (596,022B) 0x4273C8: xodtemplate_add_object_property (xodtemplate.c:1505)
| | ->06.42% (596,022B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| | ->06.42% (596,022B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| | ->06.42% (596,022B) 0x41DB6C: read_object_config_data (objects.c:272)
| | ->06.42% (596,022B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| | ->06.42% (596,022B) 0x4142F5: cgi_init (cgiutils.c:715)
| | ->06.42% (596,022B) 0x401BDB: main (status.c:213)
| |
| ->05.28% (490,640B) 0x42C554: xodtemplate_add_object_property (xodtemplate.c:2754)
| | ->05.28% (490,640B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| | ->05.28% (490,640B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| | ->05.28% (490,640B) 0x41DB6C: read_object_config_data (objects.c:272)
| | ->05.28% (490,640B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| | ->05.28% (490,640B) 0x4142F5: cgi_init (cgiutils.c:715)
| | ->05.28% (490,640B) 0x401BDB: main (status.c:213)
| |
| ->02.93% (272,608B) 0x4279B7: xodtemplate_add_object_property (xodtemplate.c:1623)
| | ->02.93% (272,608B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| | ->02.93% (272,608B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| | ->02.93% (272,608B) 0x41DB6C: read_object_config_data (objects.c:272)
| | ->02.93% (272,608B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| | ->02.93% (272,608B) 0x4142F5: cgi_init (cgiutils.c:715)
| | ->02.93% (272,608B) 0x401BDB: main (status.c:213)
| |
| ->01.32% (122,439B) 0x42A785: xodtemplate_add_object_property (xodtemplate.c:2340)
| | ->01.32% (122,439B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| | ->01.32% (122,439B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| | ->01.32% (122,439B) 0x41DB6C: read_object_config_data (objects.c:272)
| | ->01.32% (122,439B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| | ->01.32% (122,439B) 0x4142F5: cgi_init (cgiutils.c:715)
| | ->01.32% (122,439B) 0x401BDB: main (status.c:213)
| |
| ->01.23% (114,583B) 0x42A333: xodtemplate_add_object_property (xodtemplate.c:2276)
| ->01.23% (114,583B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| ->01.23% (114,583B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| ->01.23% (114,583B) 0x41DB6C: read_object_config_data (objects.c:272)
| ->01.23% (114,583B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| ->01.23% (114,583B) 0x4142F5: cgi_init (cgiutils.c:715)
| ->01.23% (114,583B) 0x401BDB: main (status.c:213)
|
->23.48% (2,181,528B) 0x425D4A: xodtemplate_begin_object_definition (xodtemplate.c:1025)
| ->23.48% (2,181,528B) 0x424D3A: xodtemplate_process_config_file (xodtemplate.c:715)
| ->23.48% (2,181,528B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| ->23.48% (2,181,528B) 0x41DB6C: read_object_config_data (objects.c:272)
| ->23.48% (2,181,528B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| ->23.48% (2,181,528B) 0x4142F5: cgi_init (cgiutils.c:715)
| ->23.48% (2,181,528B) 0x401BDB: main (status.c:213)
|
->18.72% (1,738,800B) 0x425B75: xodtemplate_begin_object_definition (xodtemplate.c:1003)
| ->18.72% (1,738,800B) 0x424D3A: xodtemplate_process_config_file (xodtemplate.c:715)
| ->18.72% (1,738,800B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| ->18.72% (1,738,800B) 0x41DB6C: read_object_config_data (objects.c:272)
| ->18.72% (1,738,800B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| ->18.72% (1,738,800B) 0x4142F5: cgi_init (cgiutils.c:715)
| ->18.72% (1,738,800B) 0x401BDB: main (status.c:213)
|
[...]
Here's the beginning of the generated xtmemory.report.txt
:
--------------------------------------------------------------------------------
Profile data file 'xtmemory.kcg.27107' (creator: xtree-1)
--------------------------------------------------------------------------------
Profiled target: ./status.cgi (PID 27107)
Events recorded: curB curBk totB totBk totFdB totFdBk
Events shown: curB curBk totB totBk totFdB totFdBk
Event sort order: curB curBk totB totBk totFdB totFdBk
Thresholds: 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: on
--------------------------------------------------------------------------------
curB curBk totB totBk totFdB totFdBk
--------------------------------------------------------------------------------
2,611,324 325,271 521,323,496 15,877,344 518,712,172 15,552,073 PROGRAM TOTALS
--------------------------------------------------------------------------------
curB curBk totB totBk totFdB totFdBk file:function
--------------------------------------------------------------------------------
2,611,324 325,271 521,323,496 15,877,344 518,706,950 15,552,040 status.c:main
2,445,481 324,858 521,117,175 15,875,832 411,212,553 14,293,051 cgiutils.c:cgi_init
2,313,991 324,973 36,326,097 1,454,524 0 0 UnknownFile???:strdup
2,309,982 324,754 325,190,313 9,291,260 255,406,655 8,173,727 cgiutils.c:read_all_object_configuration_data
2,309,982 324,754 325,090,857 9,287,116 255,406,655 8,173,727 ../common/objects.c:read_object_config_data
2,309,982 324,752 325,090,857 9,287,114 255,406,655 8,173,727 ../xdata/xodtemplate.c:xodtemplate_read_config_data
1,130,007 100,673 253,933,329 8,164,617 202,038,877 7,301,747 ../xdata/xodtemplate.c:xodtemplate_process_config_file
1,130,007 100,673 30,598,677 862,900 13,611,923 106,743 ../xdata/xodtemplate.c:xodtemplate_add_object_property
1,055,040 211,008 67,948,121 973,879 0 0 ../xdata/xodtemplate.c:xodtemplate_register_objects
1,006,730 201,346 59,455,774 805,384 0 0 ../xdata/xodtemplate.c:xodtemplate_register_service
1,006,730 201,346 57,039,622 704,711 0 0 ../common/objects.c:add_service
249,870 26,142 940,750 52,289 272,608 5 ../xdata/xodtemplate.c:xodtemplate_expand_services
163,007 251 166,772 502 3,765 251 status.c:show_service_detail
163,007 251 163,007 251 0 0 cgiutils.c:html_encode
135,365 97 195,797,952 6,580,815 155,677,157 6,115,575 cgiutils.c:read_all_status_data
131,088 2 131,088 2 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/lib/fanout.c:fanout_create
131,088 2 131,088 2 0 0 ../common/downtime.c:initialize_downtime_data
124,935 13,071 878,835 39,228 754,352 13,081 ../xdata/xodtemplate.c:xodtemplate_recombobulate_servicegroups
48,300 9,660 2,567,620 32,438 0 0 ../xdata/xodtemplate.c:xodtemplate_register_host
48,300 9,660 2,328,060 19,320 0 0 ../common/objects.c:add_host
4,277 95 195,666,864 6,580,813 155,677,157 6,115,575 ../xdata/xsddefault.c:xsddefault_read_status_data
4,277 95 195,666,864 6,580,813 155,677,157 6,115,575 ../common/statusdata.c:read_status_data
4,277 95 4,277 95 0 0 ../common/downtime.c:add_downtime
4,277 95 4,277 95 0 0 ../common/downtime.c:add_host_downtime
2,709 159 6,525 160 3,816 1 ../common/macros.c:init_macros
2,709 159 2,709 159 0 0 ../common/macros.c:init_macrox_names
456 19 456 19 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/lib/fanout.c:fanout_add
456 19 456 19 0 0 ../common/downtime.c:downtime_add
115 2 6,095 50 2,362 21 status.c:process_cgivars
104 1 6,084 49 2,259 9 getcgi.c:getcgivars
94 4 101,933 2,955 101,839 2,951 cgiutils.c:read_main_config_file
82 3 478 12 371 8 /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:nspath_normalize
82 3 478 12 371 8 /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:nspath_absolute
82 3 107 4 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:pcomp_construct
26 1 26 1 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/base/../common/shared.c:init_shared_cfg_vars
14 2 26,951 801 26,902 798 cgiutils.c:read_cgi_config_file
12 1 26,914 799 26,902 798 cgiauth.c:get_authentication_information
12 1 124 4 112 3 /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:nspath_absolute_dirname
10 2 335 8 0 0 ../xdata/xodtemplate.c:xodtemplate_register_contact
10 2 242 4 0 0 ../common/objects.c:add_contact
0 2 0 2 0 0 ../common/objects.c:post_process_object_config
0 0 249,404,696 9,692,541 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/base/../common/shared.c:mmap_fgets
0 0 188,581,920 7,199,534 94,290,960 3,599,767 /root/rpmbuild/BUILD/nagios-4.3.4/base/../common/shared.c:mmap_fgets_multiline
0 0 34,907,672 106,712 0 0 ../xdata/xodtemplate.c:xodtemplate_begin_object_definition
0 0 16,173,264 213,424 13,611,232 106,712 /root/rpmbuild/BUILD/nagios-4.3.4/lib/skiplist.c:skiplist_insert
0 0 3,414,784 106,712 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/lib/dkhash.c:dkhash_insert
0 0 3,221,536 100,673 0 0 ../common/objects.c:add_service_link_to_host
0 0 3,037,280 24 0 0 ../common/objects.c:create_object_tables
0 0 2,564,720 106,734 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/lib/skiplist.c:skiplist_new_node
0 0 2,532,072 105,503 0 0 ../common/objects.c:add_contactgroup_to_object
0 0 2,416,152 100,673 0 0 ../common/objects.c:add_contactgroup_to_service
0 0 2,311,472 109,334 1,455,398 54,667 ../xdata/xodtemplate.c:xodtemplate_recombobulate_hostgroups
0 0 2,183,584 16 0 0 /root/rpmbuild/BUILD/nagios-4.3.4/lib/dkhash.c:dkhash_create
0 0 2,148,440 107,422 0 0 ../xdata/xodtemplate.c:xodtemplate_register_hostgroup_members
0 0 2,148,440 107,422 0 0 ../common/objects.c:add_host_to_hostgroup
0 0 2,137,056 133,566 0 0 ../common/objects.c:prepend_object_to_objectlist
0 0 1,068,528 66,783 0 0 ../common/objects.c:add_object_to_objectlist
0 0 1,068,528 66,783 0 0 ../xdata/xodtemplate.c:_xodtemplate_add_group_member
0 0 1,055,197 28,595 0 0 ../common/comments.c:add_comment
[...]
FWIW, I tried also a perf record
/ perf report
of a status.cgi call from the shell. Here's the report to give us an idea where the cpu time is spent - however, please notice that this does not show the memory allocations:
Children Self Command Shared Object Symbol
+ 90.91% 0.00% status.cgi libc-2.17.so [.] __libc_start_main
+ 90.91% 0.00% status.cgi status.cgi [.] main
+ 73.87% 0.00% status.cgi status.cgi [.] cgi_init
+ 39.41% 0.06% status.cgi status.cgi [.] read_all_object_configuration_data
+ 39.16% 0.00% status.cgi status.cgi [.] read_object_config_data
+ 39.16% 0.00% status.cgi status.cgi [.] xodtemplate_read_config_data
+ 34.46% 0.00% status.cgi status.cgi [.] read_all_status_data
+ 34.46% 0.00% status.cgi status.cgi [.] read_status_data
- 24.44% 24.44% status.cgi libc-2.17.so [.] __strcmp_sse42
__libc_start_main
- main
- 23.93% cgi_init
- 12.59% read_all_object_configuration_data
read_object_config_data
- xodtemplate_read_config_data
- 7.02% xodtemplate_process_config_file
6.58% __strcmp_sse42
- 5.26% xodtemplate_register_objects
- 3.45% xodtemplate_register_servicegroup_members
__strcmp_sse42
- 1.69% xodtemplate_register_hostgroup_members
__strcmp_sse42
- 11.34% read_all_status_data
- read_status_data
9.34% __strcmp_sse42
- 2.01% xsddefault_read_status_data
- 1.94% add_service_status
add_servicestatus_to_hashlist
__strcmp_sse42
+ 22.87% 2.01% status.cgi status.cgi [.] xodtemplate_process_config_file
+ 18.98% 1.88% status.cgi status.cgi [.] xsddefault_read_status_data
- 14.35% 14.35% status.cgi status.cgi [.] is_host_member_of_hostgroup
__libc_start_main
- main
- 13.22% show_service_status_totals
is_host_member_of_hostgroup
- 0.63% show_host_status_totals
is_host_member_of_hostgroup
+ 14.04% 0.13% status.cgi status.cgi [.] show_service_status_totals
+ 11.34% 10.28% status.cgi status.cgi [.] mmap_fgets
+ 8.02% 0.00% status.cgi status.cgi [.] xodtemplate_register_objects
+ 5.83% 5.83% status.cgi status.cgi [.] add_object_to_objectlist
+ 5.58% 5.58% status.cgi status.cgi [.] strip
+ 5.26% 3.70% status.cgi status.cgi [.] xodtemplate_add_object_property
+ 5.20% 5.20% status.cgi libc-2.17.so [.] strtok
+ 5.08% 0.63% status.cgi status.cgi [.] mmap_fgets_multiline
+ 5.08% 0.13% status.cgi status.cgi [.] xodtemplate_register_servicegroup_members
+ 4.76% 2.44% status.cgi status.cgi [.] add_servicestatus_to_hashlist
What I've noticed so far:
cgiutils.c:cgi_init()
takes a lot of time and allocates a lot of memory.is_host_member_of_hostgroup()
takes a lot of cpu time.common/shared.c:mmap_fgets_multiline()
and common/shared.c:mmap_fgets_multiline_multiline()
look very suspicious. They execute an malloc()
+ free()
for each line:strip()
is expensive. . . . . . . /* calculate length of line we just read */
. . . . . . len = (int)(x - temp_mmapfile->current_position);
. . . . . .
. . . . . . /* allocate memory for the new line */
0 0 249,404,696 9,692,541 0 0 if((buf = (char *)malloc(len + 1)) == NULL)
. . . . . . return NULL;
. . . . . .
. . . . . . /* copy string to newly allocated memory and terminate the string */
. . . . . . memcpy(buf,
. . . . . . ((char *)(temp_mmapfile->mmap_buf) +
. . . . . . temp_mmapfile->current_position), len);
. . . . . . buf[len] = '\x0';
Thoughts? Take a look at the entire reports for much more detailled information:
BTW: I have a Nagios instance with a 87MB objects.cache
file. Reading and processing this cache file alone (during each a every CGI startup!) takes 3.6s.
FWIW: My nagios-4.3.4 debug build of status.cgi
(CFLAGS with -O0 -ggdb3
) on CentOS 7.5 takes 8.3s:
# prepare env vars (mentioned above)
$ time perf record -g -- ./cgi/status.cgi
real 0m8.361s
user 0m8.069s
sys 0m0.285s
I've also created a FlameGraph to visualize where the cpu time is spent:
$ perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl --title "status.cgi" --width 2500 --height 24 > perf.svg
$ convert perf.svg perf.png
Here's the result: Better version: perf.svg.gz (Load the svg in your browser to zoom in interactively.)
I wonder if a lexer that returns (integer) tokens would help to get rid of all these slow strcmp()
s of the current "manual parser" (where a major part of the cpu time is spent). Unfortunately, this is a major change.
The major improvement for Core 5 is going to be having a database backend (e.g.: merging NDO into Core). Once that happens, the plan is to switch all the cgis to use that backend.
Even in XI these problems exist, because most of XI's reports are simply using the cgis and parsing that data - so it's even another layer.
The development cycle is going to look something like this (for me) in order:
That said - Core5 will likely be a year of work, I bet (and nothing ever goes as smooth as the planning). So, if there are some quick hacks we can do to get the cgis working better for everyone, I'm all for that.
I'll take a look at your reports. Also, I plan on looking through the duplicate tokens as well and breaking whatever I can out into functions.
I've got a prototype for the mmap_getline family of functions. Accepting critique:
int get_line(char ** data, unsigned long data_len, char * line)
{
int i = 0;
if (data == NULL || *data == NULL) {
line[0] = 0;
return -1;
}
if ((*data)[0] == 0) {
line[0] = 0;
return -2;
}
if (data_len <= 0L) {
data_len = (unsigned long) strlen(*data);
/* make sure it really is 0 */
if (data_len <= 0L) {
line[0] = 0;
return -3;
}
}
for (i = 0; i < data_len; i++) {
line[i] = (*data)[i];
/* looking for newlines */
if ((*data)[i] == '\n') {
(*data) += i + 1;
line[i] = 0;
return i;
}
}
return 0;
}
It forces the caller to keep track of the original pointer to **data
, but this is trivial. It also forces whatever array goes into *line
to have a large enough buffer - but also this is trivial. Most importantly it removes many many many malloc()s
that happen along the way.
@hedenface Sorry for the delay. I'm very busy at the moment. Some comments:
data_len
is unsigned (i.e. always >= 0) but there are if statements like this: if (data_len <= 0L)
.
(What about using types like size_t
or ssize_t
instead of unsigned long
?)data
could be const char **
.strip()
which is missing here. Also \r
handling. OTOH the object_cache
doesn't necessarily have to be nicely formatted/indented.line
buffer overrun. You would have to make sure that it is at least
data_len
bytes long.servicegroup
definition with a length of 187161(!) chars. I think I have to rethink my service groups... (cat /var/nagios/objects.cache|awk '{print length, $0}'|sort -nr|head -1
or wc -L /var/nagios/objects.cache
if you are only interested in the number)This was just the prototype, hence missing the \r
, \t
, etc.
I think setting it to a 16/32K default and having that configurable would be a nice option. That way there is no need for recompilation.
We're working internally a bit on getting XI 5.5 out the door (with Core 4.4.1) so I'll be focusing mostly on that. This change here will be in Core 5.0.0 I bet - or at least something like it.
I was able to get a CGI loaded using the query handler pipe as opposed to the configuration files - but the QH API is not very robust, so there will need to be some endpoints added there.
Also thank you for the feedback!
I'm likely going to start fresh on the 5 branch - as this was started by someone who is no longer contributing, and I think it would be easier in the long run to just focus on some of the major improvements (from where we are currently) we want and then add the rest of them as we go.
More feedback: This just started happening to me today on 4.4.1 on Ubuntu 16.04.5. The Nagios process uses 25% on it's own, every status.cgi execution uses 100% CPU and takes 4 seconds to load. This is on a system with 670 hosts and about 6500 services on a quad core VM with 14 GB of RAM. Last week, load times were quick enough that you didn't even notice, and now Nagios is alerting us constantly for critical CPU, PROC, and zombie processes. I don't know what changed unless the latest round of updates from Ubuntu changed something, as we've been monitoring this number of hosts and services for months without issue.
Update: Spun up a new server on Ubuntu 18.04 this morning. Imported the same config and at most hitting a load average of 0.5 on a dual core VM, down from a load average of 3 to 5 on a quad core.
Fresh installation of RH7 and installed nagios 4.3.2 from the repos. status.cgi was taking 7+ second to respond when using the webUI.
Tried installing from source as well with the same results.
In top status.cgi will reach 100% cpu for a few seconds.
Server is a VM with 4 cores, and 8gb ram.
The config files were adapted from a 3.1.2. If that makes a difference.
nagiostats.txt