NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.54k stars 447 forks source link

Status.cgi High cpu load nagios core 4.3.2 on Red Hat 7 #409

Open pattersdan opened 7 years ago

pattersdan commented 7 years ago

Fresh installation of RH7 and installed nagios 4.3.2 from the repos. status.cgi was taking 7+ second to respond when using the webUI.

Tried installing from source as well with the same results.

In top status.cgi will reach 100% cpu for a few seconds.

Server is a VM with 4 cores, and 8gb ram.

The config files were adapted from a 3.1.2. If that makes a difference.

nagiostats.txt

hedenface commented 7 years ago

Are you able to run an strace on it while it's acting up?

pattersdan commented 7 years ago

What I've been able to grab is attached.

Its mostly
brk(NULL) = 0x2e3c000 brk(0x2e5d000) = 0x2e5d000 brk(NULL) = 0x2e5d000 brk(0x2e7e000) = 0x2e7e000 brk(NULL) = 0x2e7e000 brk(0x2e9f000) = 0x2e9f000 brk(NULL) = 0x2e9f000 brk(0x2ec0000) = 0x2ec0000 brk(NULL) = 0x2ec0000

and

write(1, "der=0 cellspacing=0 cellpadding="..., 4096) = 4096 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3559, ...}) = 0

strace.out.txt

hedenface commented 7 years ago

Try running the following command as the user whoever the cgis run as:

export TZ=:/etc/localtime

Please update the thread with results

pattersdan commented 7 years ago

I tried the above command. It didn't make much of a difference time wise, though the strace looks a bit different. I watched an strace after a captured this one, and the 7 seconds is when all the 12832 20:04:26 brk(NULL) = 0x2009000 12832 20:04:26 brk(0x202a000) = 0x202a000 12832 20:04:26 brk(NULL) = 0x202a000 12832 20:04:26 brk(0x204b000) = 0x204b000 12832 20:04:26 brk(NULL) = 0x204b000 12832 20:04:26 brk(0x206c000) = 0x206c000 stuff is going by. The page starts to visibly load after when

I've attached the latest strace, and I appreciate your help!! strace3.txt

hedenface commented 7 years ago

It looks like it's putting an awful lot of stuff in memory before it actually starts doing anything (5 seconds of brk()). This is probably some kind of inefficiency in the code on our part - can I see the output of sysctl -a | grep ^vm and the web config for Core?

hedenface commented 7 years ago

Also - how many hosts/services are we talking about?

pattersdan commented 7 years ago

Its 98 host with about 1300 services. I migrated the config files from a nagios 3.1.2 installation. I had had to tweak them a little bit to get them to work under this new installation, is there a possibility that's the issue? webconf.txt webconf.txt

sudo sysctl -a | grep ^vm sysctl: reading key "net.ipv6.conf.all.stable_secret" sysctl: reading key "net.ipv6.conf.default.stable_secret" sysctl: reading key "net.ipv6.conf.eth0.stable_secret" sysctl: reading key "net.ipv6.conf.lo.stable_secret" vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 30 vm.dirty_writeback_centisecs = 500 vm.drop_caches = 0 vm.extfrag_threshold = 500 vm.hugepages_treat_as_movable = 0 vm.hugetlb_shm_group = 0 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 vm.max_map_count = 65530 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 67584 vm.min_slab_ratio = 5 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.mmap_rnd_bits = 28 vm.mmap_rnd_compat_bits = 8 vm.nr_hugepages = 0 vm.nr_hugepages_mempolicy = 0 vm.nr_overcommit_hugepages = 0 vm.nr_pdflush_threads = 0 vm.numa_zonelist_order = default vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 vm.page-cluster = 3 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.stat_interval = 1 vm.swappiness = 30 vm.user_reserve_kbytes = 131072 vm.vfs_cache_pressure = 100 vm.zone_reclaim_mode = 0

rickatnight11 commented 7 years ago

We're observing the exact same behavior with both Nagios 4.1.1 and 4.3.2 on CentOS 6.

knweiss commented 6 years ago

Is this repeatable?

Shouldn't a little instrumentation around the malloc() calls in cgi/status.c give us a clue? You could configure malloc() debugging with mtrace() (see man 3 mtrace for an example). This should tell us where the allocations are coming from.

FWIW: The difference between the brk() arguments is 0x21000. This is the default glibc heap allocation (including padding).

knweiss commented 6 years ago

I've tried a mtrace of status.cgi by myself. Unfortunately, it does not trace strdup()'s memory allocations. What I saw: There are many tiny allocations.

Then I tried valgrind's massif tool and came up with this (use another QUERY_STRING that suits you):

$ REMOTE_USER=nagiosadmin REQUEST_METHOD=GET QUERY_STRING="hostgroup=REDACTED&style=detail&hoststatustypes=3&serviceprops=42&servicestatustypes=28&hostprops=42" valgrind --xtree-memory=full --tool=massif ./status.cgi

$ ms_print massif.out.27107 | tee massif.report.txt

$ callgrind_annotate --auto=yes --inclusive=yes --sort=curB:100,curBk:100,totB:100,totBk:100,totFdB:100,totFdBk:100 xtmemory.kcg.27107 |tee xtmemory.report.txt

Here's the beginning of the generated massif.report.txt:

--------------------------------------------------------------------------------
Command:            ./status.cgi
Massif arguments:   --xtree-memory=full
ms_print arguments: massif.out.27107
--------------------------------------------------------------------------------

    MB
150.3^                                    #
     |                              ::::::#
     |                              ::::: #
     |                              ::::: #                                  :
     |                              ::::: #                         @:::::::::
     |                             :::::: #                    :::::@::::    :
     |                             :::::: #              ::::::::: :@::::    :
     |                             :::::: #         ::::@: ::: ::: :@::::    :
     |                             :::::: #     ::::::::@: ::: ::: :@::::    :
     |                             :::::: #:::::::: ::::@: ::: ::: :@::::    :
     |                             :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |                         ::@@:::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |                      :::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |                    @@:::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |                @@:@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |              @@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |           ::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |        :@@::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |     @@@:@ ::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
     |  @@:@ @:@ ::@@@@ :@@ :::::@ :::::: #:::: ::: ::::@: ::: ::: :@::::    :
   0 +----------------------------------------------------------------------->Gi
     0                                                                   32.76

Number of snapshots: 58
 Detailed snapshots: [2, 4, 5, 7, 10, 11, 12, 13, 15, 16, 22, 29 (peak), 41, 51]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1    546,311,263        4,470,808        3,276,914     1,193,894            0
  2  1,403,757,742        9,289,408        6,588,744     2,700,664            0
70.93% (6,588,744B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->23.73% (2,204,425B) 0x4EB9808: strdup (in /usr/lib64/libc-2.17.so)
| ->06.55% (608,133B) in 47 places, all below massif's threshold (1.00%)
| |
| ->06.42% (596,022B) 0x4273C8: xodtemplate_add_object_property (xodtemplate.c:1505)
| | ->06.42% (596,022B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| |   ->06.42% (596,022B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| |     ->06.42% (596,022B) 0x41DB6C: read_object_config_data (objects.c:272)
| |       ->06.42% (596,022B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| |         ->06.42% (596,022B) 0x4142F5: cgi_init (cgiutils.c:715)
| |           ->06.42% (596,022B) 0x401BDB: main (status.c:213)
| |
| ->05.28% (490,640B) 0x42C554: xodtemplate_add_object_property (xodtemplate.c:2754)
| | ->05.28% (490,640B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| |   ->05.28% (490,640B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| |     ->05.28% (490,640B) 0x41DB6C: read_object_config_data (objects.c:272)
| |       ->05.28% (490,640B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| |         ->05.28% (490,640B) 0x4142F5: cgi_init (cgiutils.c:715)
| |           ->05.28% (490,640B) 0x401BDB: main (status.c:213)
| |
| ->02.93% (272,608B) 0x4279B7: xodtemplate_add_object_property (xodtemplate.c:1623)
| | ->02.93% (272,608B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| |   ->02.93% (272,608B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| |     ->02.93% (272,608B) 0x41DB6C: read_object_config_data (objects.c:272)
| |       ->02.93% (272,608B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| |         ->02.93% (272,608B) 0x4142F5: cgi_init (cgiutils.c:715)
| |           ->02.93% (272,608B) 0x401BDB: main (status.c:213)
| |
| ->01.32% (122,439B) 0x42A785: xodtemplate_add_object_property (xodtemplate.c:2340)
| | ->01.32% (122,439B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
| |   ->01.32% (122,439B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
| |     ->01.32% (122,439B) 0x41DB6C: read_object_config_data (objects.c:272)
| |       ->01.32% (122,439B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
| |         ->01.32% (122,439B) 0x4142F5: cgi_init (cgiutils.c:715)
| |           ->01.32% (122,439B) 0x401BDB: main (status.c:213)
| |
| ->01.23% (114,583B) 0x42A333: xodtemplate_add_object_property (xodtemplate.c:2276)
|   ->01.23% (114,583B) 0x424DF1: xodtemplate_process_config_file (xodtemplate.c:744)
|     ->01.23% (114,583B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
|       ->01.23% (114,583B) 0x41DB6C: read_object_config_data (objects.c:272)
|         ->01.23% (114,583B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
|           ->01.23% (114,583B) 0x4142F5: cgi_init (cgiutils.c:715)
|             ->01.23% (114,583B) 0x401BDB: main (status.c:213)
|
->23.48% (2,181,528B) 0x425D4A: xodtemplate_begin_object_definition (xodtemplate.c:1025)
| ->23.48% (2,181,528B) 0x424D3A: xodtemplate_process_config_file (xodtemplate.c:715)
|   ->23.48% (2,181,528B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
|     ->23.48% (2,181,528B) 0x41DB6C: read_object_config_data (objects.c:272)
|       ->23.48% (2,181,528B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
|         ->23.48% (2,181,528B) 0x4142F5: cgi_init (cgiutils.c:715)
|           ->23.48% (2,181,528B) 0x401BDB: main (status.c:213)
|
->18.72% (1,738,800B) 0x425B75: xodtemplate_begin_object_definition (xodtemplate.c:1003)
| ->18.72% (1,738,800B) 0x424D3A: xodtemplate_process_config_file (xodtemplate.c:715)
|   ->18.72% (1,738,800B) 0x4244CE: xodtemplate_read_config_data (xodtemplate.c:354)
|     ->18.72% (1,738,800B) 0x41DB6C: read_object_config_data (objects.c:272)
|       ->18.72% (1,738,800B) 0x413F0D: read_all_object_configuration_data (cgiutils.c:584)
|         ->18.72% (1,738,800B) 0x4142F5: cgi_init (cgiutils.c:715)
|           ->18.72% (1,738,800B) 0x401BDB: main (status.c:213)
|
[...]

Here's the beginning of the generated xtmemory.report.txt:

--------------------------------------------------------------------------------
Profile data file 'xtmemory.kcg.27107' (creator: xtree-1)
--------------------------------------------------------------------------------
Profiled target:  ./status.cgi (PID 27107)
Events recorded:  curB curBk totB totBk totFdB totFdBk
Events shown:     curB curBk totB totBk totFdB totFdBk
Event sort order: curB curBk totB totBk totFdB totFdBk
Thresholds:       100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation:  on

--------------------------------------------------------------------------------
     curB   curBk        totB      totBk      totFdB    totFdBk
--------------------------------------------------------------------------------
2,611,324 325,271 521,323,496 15,877,344 518,712,172 15,552,073  PROGRAM TOTALS

--------------------------------------------------------------------------------
     curB   curBk        totB      totBk      totFdB    totFdBk  file:function
--------------------------------------------------------------------------------
2,611,324 325,271 521,323,496 15,877,344 518,706,950 15,552,040  status.c:main
2,445,481 324,858 521,117,175 15,875,832 411,212,553 14,293,051  cgiutils.c:cgi_init
2,313,991 324,973  36,326,097  1,454,524           0          0  UnknownFile???:strdup
2,309,982 324,754 325,190,313  9,291,260 255,406,655  8,173,727  cgiutils.c:read_all_object_configuration_data
2,309,982 324,754 325,090,857  9,287,116 255,406,655  8,173,727  ../common/objects.c:read_object_config_data
2,309,982 324,752 325,090,857  9,287,114 255,406,655  8,173,727  ../xdata/xodtemplate.c:xodtemplate_read_config_data
1,130,007 100,673 253,933,329  8,164,617 202,038,877  7,301,747  ../xdata/xodtemplate.c:xodtemplate_process_config_file
1,130,007 100,673  30,598,677    862,900  13,611,923    106,743  ../xdata/xodtemplate.c:xodtemplate_add_object_property
1,055,040 211,008  67,948,121    973,879           0          0  ../xdata/xodtemplate.c:xodtemplate_register_objects
1,006,730 201,346  59,455,774    805,384           0          0  ../xdata/xodtemplate.c:xodtemplate_register_service
1,006,730 201,346  57,039,622    704,711           0          0  ../common/objects.c:add_service
  249,870  26,142     940,750     52,289     272,608          5  ../xdata/xodtemplate.c:xodtemplate_expand_services
  163,007     251     166,772        502       3,765        251  status.c:show_service_detail
  163,007     251     163,007        251           0          0  cgiutils.c:html_encode
  135,365      97 195,797,952  6,580,815 155,677,157  6,115,575  cgiutils.c:read_all_status_data
  131,088       2     131,088          2           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/lib/fanout.c:fanout_create
  131,088       2     131,088          2           0          0  ../common/downtime.c:initialize_downtime_data
  124,935  13,071     878,835     39,228     754,352     13,081  ../xdata/xodtemplate.c:xodtemplate_recombobulate_servicegroups
   48,300   9,660   2,567,620     32,438           0          0  ../xdata/xodtemplate.c:xodtemplate_register_host
   48,300   9,660   2,328,060     19,320           0          0  ../common/objects.c:add_host
    4,277      95 195,666,864  6,580,813 155,677,157  6,115,575  ../xdata/xsddefault.c:xsddefault_read_status_data
    4,277      95 195,666,864  6,580,813 155,677,157  6,115,575  ../common/statusdata.c:read_status_data
    4,277      95       4,277         95           0          0  ../common/downtime.c:add_downtime
    4,277      95       4,277         95           0          0  ../common/downtime.c:add_host_downtime
    2,709     159       6,525        160       3,816          1  ../common/macros.c:init_macros
    2,709     159       2,709        159           0          0  ../common/macros.c:init_macrox_names
      456      19         456         19           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/lib/fanout.c:fanout_add
      456      19         456         19           0          0  ../common/downtime.c:downtime_add
      115       2       6,095         50       2,362         21  status.c:process_cgivars
      104       1       6,084         49       2,259          9  getcgi.c:getcgivars
       94       4     101,933      2,955     101,839      2,951  cgiutils.c:read_main_config_file
       82       3         478         12         371          8  /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:nspath_normalize
       82       3         478         12         371          8  /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:nspath_absolute
       82       3         107          4           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:pcomp_construct
       26       1          26          1           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/base/../common/shared.c:init_shared_cfg_vars
       14       2      26,951        801      26,902        798  cgiutils.c:read_cgi_config_file
       12       1      26,914        799      26,902        798  cgiauth.c:get_authentication_information
       12       1         124          4         112          3  /root/rpmbuild/BUILD/nagios-4.3.4/lib/nspath.c:nspath_absolute_dirname
       10       2         335          8           0          0  ../xdata/xodtemplate.c:xodtemplate_register_contact
       10       2         242          4           0          0  ../common/objects.c:add_contact
        0       2           0          2           0          0  ../common/objects.c:post_process_object_config
        0       0 249,404,696  9,692,541           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/base/../common/shared.c:mmap_fgets
        0       0 188,581,920  7,199,534  94,290,960  3,599,767  /root/rpmbuild/BUILD/nagios-4.3.4/base/../common/shared.c:mmap_fgets_multiline
        0       0  34,907,672    106,712           0          0  ../xdata/xodtemplate.c:xodtemplate_begin_object_definition
        0       0  16,173,264    213,424  13,611,232    106,712  /root/rpmbuild/BUILD/nagios-4.3.4/lib/skiplist.c:skiplist_insert
        0       0   3,414,784    106,712           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/lib/dkhash.c:dkhash_insert
        0       0   3,221,536    100,673           0          0  ../common/objects.c:add_service_link_to_host
        0       0   3,037,280         24           0          0  ../common/objects.c:create_object_tables
        0       0   2,564,720    106,734           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/lib/skiplist.c:skiplist_new_node
        0       0   2,532,072    105,503           0          0  ../common/objects.c:add_contactgroup_to_object
        0       0   2,416,152    100,673           0          0  ../common/objects.c:add_contactgroup_to_service
        0       0   2,311,472    109,334   1,455,398     54,667  ../xdata/xodtemplate.c:xodtemplate_recombobulate_hostgroups
        0       0   2,183,584         16           0          0  /root/rpmbuild/BUILD/nagios-4.3.4/lib/dkhash.c:dkhash_create
        0       0   2,148,440    107,422           0          0  ../xdata/xodtemplate.c:xodtemplate_register_hostgroup_members
        0       0   2,148,440    107,422           0          0  ../common/objects.c:add_host_to_hostgroup
        0       0   2,137,056    133,566           0          0  ../common/objects.c:prepend_object_to_objectlist
        0       0   1,068,528     66,783           0          0  ../common/objects.c:add_object_to_objectlist
        0       0   1,068,528     66,783           0          0  ../xdata/xodtemplate.c:_xodtemplate_add_group_member
        0       0   1,055,197     28,595           0          0  ../common/comments.c:add_comment
[...]

FWIW, I tried also a perf record / perf report of a status.cgi call from the shell. Here's the report to give us an idea where the cpu time is spent - however, please notice that this does not show the memory allocations:

  Children      Self  Command     Shared Object      Symbol
+   90.91%     0.00%  status.cgi  libc-2.17.so       [.] __libc_start_main
+   90.91%     0.00%  status.cgi  status.cgi         [.] main
+   73.87%     0.00%  status.cgi  status.cgi         [.] cgi_init
+   39.41%     0.06%  status.cgi  status.cgi         [.] read_all_object_configuration_data
+   39.16%     0.00%  status.cgi  status.cgi         [.] read_object_config_data
+   39.16%     0.00%  status.cgi  status.cgi         [.] xodtemplate_read_config_data
+   34.46%     0.00%  status.cgi  status.cgi         [.] read_all_status_data
+   34.46%     0.00%  status.cgi  status.cgi         [.] read_status_data
-   24.44%    24.44%  status.cgi  libc-2.17.so       [.] __strcmp_sse42
     __libc_start_main
   - main
      - 23.93% cgi_init
         - 12.59% read_all_object_configuration_data
              read_object_config_data
            - xodtemplate_read_config_data
               - 7.02% xodtemplate_process_config_file
                    6.58% __strcmp_sse42
               - 5.26% xodtemplate_register_objects
                  - 3.45% xodtemplate_register_servicegroup_members
                       __strcmp_sse42
                  - 1.69% xodtemplate_register_hostgroup_members
                       __strcmp_sse42
         - 11.34% read_all_status_data
            - read_status_data
                 9.34% __strcmp_sse42
               - 2.01% xsddefault_read_status_data
                  - 1.94% add_service_status
                       add_servicestatus_to_hashlist
                       __strcmp_sse42
+   22.87%     2.01%  status.cgi  status.cgi         [.] xodtemplate_process_config_file
+   18.98%     1.88%  status.cgi  status.cgi         [.] xsddefault_read_status_data
-   14.35%    14.35%  status.cgi  status.cgi         [.] is_host_member_of_hostgroup
     __libc_start_main
   - main
      - 13.22% show_service_status_totals
           is_host_member_of_hostgroup
      - 0.63% show_host_status_totals
           is_host_member_of_hostgroup
+   14.04%     0.13%  status.cgi  status.cgi         [.] show_service_status_totals
+   11.34%    10.28%  status.cgi  status.cgi         [.] mmap_fgets
+    8.02%     0.00%  status.cgi  status.cgi         [.] xodtemplate_register_objects
+    5.83%     5.83%  status.cgi  status.cgi         [.] add_object_to_objectlist
+    5.58%     5.58%  status.cgi  status.cgi         [.] strip
+    5.26%     3.70%  status.cgi  status.cgi         [.] xodtemplate_add_object_property
+    5.20%     5.20%  status.cgi  libc-2.17.so       [.] strtok
+    5.08%     0.63%  status.cgi  status.cgi         [.] mmap_fgets_multiline
+    5.08%     0.13%  status.cgi  status.cgi         [.] xodtemplate_register_servicegroup_members
+    4.76%     2.44%  status.cgi  status.cgi         [.] add_servicestatus_to_hashlist

What I've noticed so far:

   .     .           .         .          .         .   /* calculate length of line we just read */
   .     .           .         .          .         .   len = (int)(x - temp_mmapfile->current_position);
   .     .           .         .          .         .
   .     .           .         .          .         .   /* allocate memory for the new line */
   0     0 249,404,696 9,692,541          0         0   if((buf = (char *)malloc(len + 1)) == NULL)
   .     .           .         .          .         .           return NULL;
   .     .           .         .          .         .
   .     .           .         .          .         .   /* copy string to newly allocated memory and terminate the string */
   .     .           .         .          .         .   memcpy(buf,
   .     .           .         .          .         .          ((char *)(temp_mmapfile->mmap_buf) +
   .     .           .         .          .         .           temp_mmapfile->current_position), len);
   .     .           .         .          .         .   buf[len] = '\x0';

Thoughts? Take a look at the entire reports for much more detailled information:

massif.report.txt xtmemory.report.txt

knweiss commented 6 years ago

BTW: I have a Nagios instance with a 87MB objects.cache file. Reading and processing this cache file alone (during each a every CGI startup!) takes 3.6s.

knweiss commented 6 years ago

FWIW: My nagios-4.3.4 debug build of status.cgi (CFLAGS with -O0 -ggdb3) on CentOS 7.5 takes 8.3s:

# prepare env vars (mentioned above)
$ time perf record -g -- ./cgi/status.cgi
real    0m8.361s
user    0m8.069s
sys     0m0.285s

I've also created a FlameGraph to visualize where the cpu time is spent:

$ perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl --title "status.cgi" --width 2500 --height 24 > perf.svg
$ convert perf.svg perf.png

Here's the result: perf Better version: perf.svg.gz (Load the svg in your browser to zoom in interactively.)

How to read a cpu flame graph

knweiss commented 6 years ago

I wonder if a lexer that returns (integer) tokens would help to get rid of all these slow strcmp()s of the current "manual parser" (where a major part of the cpu time is spent). Unfortunately, this is a major change.

hedenface commented 6 years ago

The major improvement for Core 5 is going to be having a database backend (e.g.: merging NDO into Core). Once that happens, the plan is to switch all the cgis to use that backend.

Even in XI these problems exist, because most of XI's reports are simply using the cgis and parsing that data - so it's even another layer.

The development cycle is going to look something like this (for me) in order:

  1. Do some quick patches on NRPE and get a 3.2.2 or 3.3 depending on the size of the patching out.
  2. Work on and release Plugins 2.3.
  3. Finish up the work I've been doing for NDO3 to bring true multithreading.
  4. Start work on Core5 functionality. The main things we're looking at adding is native remote workers (this functionality exists now, but executing it is somewhat cryptic), and database config/object/archive data.

That said - Core5 will likely be a year of work, I bet (and nothing ever goes as smooth as the planning). So, if there are some quick hacks we can do to get the cgis working better for everyone, I'm all for that.

I'll take a look at your reports. Also, I plan on looking through the duplicate tokens as well and breaking whatever I can out into functions.

hedenface commented 6 years ago

I've got a prototype for the mmap_getline family of functions. Accepting critique:


int get_line(char ** data, unsigned long data_len, char * line)
{
    int i = 0;

    if (data == NULL || *data == NULL) {
        line[0] = 0;
        return -1;
    }

    if ((*data)[0] == 0) {
        line[0] = 0;
        return -2;
    }

    if (data_len <= 0L) {
        data_len = (unsigned long) strlen(*data);

        /* make sure it really is 0 */
        if (data_len <= 0L) {
            line[0] = 0;
            return -3;
        }
    }

    for (i = 0; i < data_len; i++) {

        line[i] = (*data)[i];

        /* looking for newlines */
        if ((*data)[i] == '\n') {

            (*data) += i + 1;
            line[i] = 0;
            return i;
        }
    }

    return 0;
}

It forces the caller to keep track of the original pointer to **data, but this is trivial. It also forces whatever array goes into *line to have a large enough buffer - but also this is trivial. Most importantly it removes many many many malloc()s that happen along the way.

knweiss commented 6 years ago

@hedenface Sorry for the delay. I'm very busy at the moment. Some comments:

hedenface commented 6 years ago

This was just the prototype, hence missing the \r, \t, etc.

I think setting it to a 16/32K default and having that configurable would be a nice option. That way there is no need for recompilation.

We're working internally a bit on getting XI 5.5 out the door (with Core 4.4.1) so I'll be focusing mostly on that. This change here will be in Core 5.0.0 I bet - or at least something like it.

I was able to get a CGI loaded using the query handler pipe as opposed to the configuration files - but the QH API is not very robust, so there will need to be some endpoints added there.

hedenface commented 6 years ago

Also thank you for the feedback!

I'm likely going to start fresh on the 5 branch - as this was started by someone who is no longer contributing, and I think it would be easier in the long run to just focus on some of the major improvements (from where we are currently) we want and then add the rest of them as we go.

aaronpollock commented 6 years ago

More feedback: This just started happening to me today on 4.4.1 on Ubuntu 16.04.5. The Nagios process uses 25% on it's own, every status.cgi execution uses 100% CPU and takes 4 seconds to load. This is on a system with 670 hosts and about 6500 services on a quad core VM with 14 GB of RAM. Last week, load times were quick enough that you didn't even notice, and now Nagios is alerting us constantly for critical CPU, PROC, and zombie processes. I don't know what changed unless the latest round of updates from Ubuntu changed something, as we've been monitoring this number of hosts and services for months without issue.

Update: Spun up a new server on Ubuntu 18.04 this morning. Imported the same config and at most hitting a load average of 0.5 on a dual core VM, down from a load average of 3 to 5 on a quad core.