Wrong message order in ndo2db

Coservit commented 8 years ago

Hello,

we've updated our Nagios from 3.5 (with ndoutils 1.4b7) to Nagios 4.2.1 with the last ndoutils code (2.1.1). We use it in disruption-critical environment.

We encountered very strange problem where ndo2db mix-up suddenly messages comming from ndomod.o running on our Nagios box. We started to investigate the problem after we've noticed ndo2db: mysql_error messages in syslog.

For example, one of the errors:

ndo2db: Error: mysql_query() failed for 'INSERT INTO nagios_servicestatus SET instance_id='102', service_object_id='60810', status_update_time=FROM_UNIXTIME(1478252888), output='DISK IO: Disk 0 C G D read : 0 Mo/s, write : 0 Mo/s : Disk 1 E read : 0 Mo/s, write : 0 Mo/s :', long_output='', perfdata='0_C_G_D_read=0Mo/s 0_C_G_D_write=0Mo/s 1_E_read=0Mo/s 1_E_write=0Mo/s', current_state='0', has_been_checked='1', should_be_scheduled='0', current_check_attempt='1', max_check_attempts='1', last_check=FROM_UNIXTIME(147825274178860), next_check=FROM_UNIXTIME(0), check_type='0', last_state_change=FROM_UNIXTIME(0), last_hard_state_change=FROM_UNIXTIME(0), last_hard_state='0', last_time_ok=FROM_UNIXTIME(0), last_time_warning=FROM_UNIXTIME(0), last_time_unknown=FROM_UNIXTIME(1477551642), last_time_critical=FROM_UNIXTIME(1476991957), state_type='1', last_notification=FROM_UNIXTIME(0), next_notification=FROM_UNIXTIME(0), no_more_notifications='0', notifications_enabled='0', problem_has_been_acknowledged='0', acknowledgement_type='0', current_notification_number='0', passive_checks_enabled='0', active_checks_enabled='1', event_handler_enabled='1', flap_detection_enabled='1', is_flapping='0', percent_state_change='0.000000', latency='0.000000', execution_time='0.452000', scheduled_downtime_depth='0', failure_prediction_enabled='0', process_performance_data='1', obsess_over_service='1', modified_service_attributes='0', event_handler='', check_command='check_traffic!coservit!65541!2000!3000!k!2!!!', normal_check_interval='5.000000', retry_check_interval='1.000000', check_timeperiod_object_id='50192' ON DUPLICATE KEY UPDATE instance_id='102', service_object_id='60810', status_update_time=FROM_UNIXTIME(1478252888), output='DISK IO: Disk 0 C G D read : 0 Mo/s, write : 0 Mo/s : Disk 1 E read : 0 Mo/s, write : 0 Mo/s :', long_output='', perfdata='0_C_G_D_read=0Mo/s 0_C_G_D_write=0Mo/s 1_E_read=0Mo/s 1_E_write=0Mo/s', current_state='0', has_been_checked='1', should_be_scheduled='0', current_check_attempt='1', max_ch

ndo2db: mysql_error: 'Column 'last_check' cannot be null'

As you can see it fails because 147825274178860 is 'invalid' timestamp with 178860 inadvertedly added.

We had complete logs for ndo2db for this problem and we can see this there (extract around the problematic line only):

[1478252888.680710] [001.2] [pid=30720] Queue Message: assive_service_freshness
86=5.00000
109=1.00000
209=1
262=IDSERVICE:0:11723
262=DISPLAYNAME:0:A-VSUpdateAgent-Status
262=HOSTALIAS:0:Pc-Philippe-Agent
262=DOCUMENTATION:0:
262=INSTRUCTIONS:0:
999

213:
1=1202
2=0
3=0
4=1478252888.270383
53=Wpe6yZ5gdDEAifhwCRJQxNnkG
114=qrD2e6kRNnoT4gsQXH19PmcBZ
95=DISK IO: Disk 0 C G D read : 0 Mo/s, write : 0 Mo/s : Disk 1 E read : 0 Mo/s, write : 0 Mo/s :
125=
99=0_C_G_D_read=0Mo/s 0_C_G_D_write=0Mo/s 1_E_read=0Mo/s 1_E_write=0Mo/s
27=0
51=1
115=0
25=1
76=1
61=147825274
[1478252888.680717] [001.2] [pid=30720] Handling: 11=process_passive_service_freshness
[1478252888.680734] [001.2] [pid=30720] Handling: 86=5.00000
[1478252888.680752] [001.2] [pid=30720] Handling: 109=1.00000
[1478252888.680769] [001.2] [pid=30720] Handling: 209=1
[1478252888.680786] [001.2] [pid=30720] Handling: 262=IDSERVICE:0:11723
[1478252888.680804] [001.2] [pid=30720] Handling: 262=DISPLAYNAME:0:A-VSUpdateAgent-Status
[1478252888.680821] [001.2] [pid=30720] Handling: 262=HOSTALIAS:0:Pc-Philippe-Agent
[1478252888.680839] [001.2] [pid=30720] Handling: 262=DOCUMENTATION:0:
[1478252888.680857] [001.2] [pid=30720] Handling: 262=INSTRUCTIONS:0:
[1478252888.680874] [001.2] [pid=30720] Handling: 999
[1478252888.697348] [001.2] [pid=30720] Handling: 
[1478252888.697384] [001.2] [pid=30720] Handling: 
[1478252888.697401] [001.2] [pid=30720] Handling: 213:
[1478252888.697421] [001.2] [pid=30720] Handling: 1=1202
[1478252888.697439] [001.2] [pid=30720] Handling: 2=0
[1478252888.697457] [001.2] [pid=30720] Handling: 3=0
[1478252888.697474] [001.2] [pid=30720] Handling: 4=1478252888.270383
[1478252888.697491] [001.2] [pid=30720] Handling: 53=Wpe6yZ5gdDEAifhwCRJQxNnkG
[1478252888.697530] [001.2] [pid=30720] Handling: 114=qrD2e6kRNnoT4gsQXH19PmcBZ
[1478252888.697548] [001.2] [pid=30720] Handling: 95=DISK IO: Disk 0 C G D read : 0 Mo/s, write : 0 Mo/s : Disk 1 E read : 0 Mo/s, write : 0 Mo/s :
[1478252888.697565] [001.2] [pid=30720] Handling: 125=
[1478252888.697583] [001.2] [pid=30720] Handling: 99=0_C_G_D_read=0Mo/s 0_C_G_D_write=0Mo/s 1_E_read=0Mo/s 1_E_write=0Mo/s
[1478252888.697600] [001.2] [pid=30720] Handling: 27=0
[1478252888.697617] [001.2] [pid=30720] Handling: 51=1
[1478252888.697634] [001.2] [pid=30720] Handling: 115=0
[1478252888.697651] [001.2] [pid=30720] Handling: 25=1
[1478252888.697668] [001.2] [pid=30720] Handling: 76=1
[1478252888.697691] [001.2] [pid=30720] Queue Message: 178860
67=1477551642
64=1476991957
121=1
62=0
84=0
85=0
88=0
101=0
7=0
26=0
97=0
38=1
9=1
47=1
54=0
98=0.00000
71=0.00000
42=0.23100
113=0
45=0
103=1
93=1
80=0
37=
11=check_snmp_win_swap!coservit!85!95
86=5.00000
109=1.00000
209=1
262=IDSERVICE:0:11700
262=DISPLAYNAME:0:MS-WIN-SWAP
262=HOSTALIAS:0:PC-Philippe
262=DOCUMENTATION:0:

[1478252888.697698] [001.2] [pid=30720] Handling: 61=147825274178860
[1478252888.697715] [001.2] [pid=30720] Handling: 67=1477551642
[1478252888.697733] [001.2] [pid=30720] Handling: 64=1476991957
[1478252888.697750] [001.2] [pid=30720] Handling: 121=1
[1478252888.697767] [001.2] [pid=30720] Handling: 62=0
[1478252888.697784] [001.2] [pid=30720] Handling: 84=0
[1478252888.697801] [001.2] [pid=30720] Handling: 85=0
[1478252888.697818] [001.2] [pid=30720] Handling: 88=0
[1478252888.697835] [001.2] [pid=30720] Handling: 101=0
[1478252888.697852] [001.2] [pid=30720] Handling: 7=0
[1478252888.697869] [001.2] [pid=30720] Handling: 26=0
[1478252888.697886] [001.2] [pid=30720] Handling: 97=0
[1478252888.697903] [001.2] [pid=30720] Handling: 38=1
[1478252888.697920] [001.2] [pid=30720] Handling: 9=1
[1478252888.697937] [001.2] [pid=30720] Handling: 47=1
[1478252888.697954] [001.2] [pid=30720] Handling: 54=0
[1478252888.697971] [001.2] [pid=30720] Handling: 98=0.00000
[1478252888.697988] [001.2] [pid=30720] Handling: 71=0.00000
[1478252888.698005] [001.2] [pid=30720] Handling: 42=0.23100
[1478252888.698022] [001.2] [pid=30720] Handling: 113=0
[1478252888.698039] [001.2] [pid=30720] Handling: 45=0
[1478252888.698057] [001.2] [pid=30720] Handling: 103=1
[1478252888.698074] [001.2] [pid=30720] Handling: 93=1
[1478252888.698092] [001.2] [pid=30720] Handling: 80=0
[1478252888.698109] [001.2] [pid=30720] Handling: 37=
[1478252888.698126] [001.2] [pid=30720] Handling: 11=check_snmp_win_swap!coservit!85!95
[1478252888.698143] [001.2] [pid=30720] Handling: 86=5.00000
[1478252888.698160] [001.2] [pid=30720] Handling: 109=1.00000
[1478252888.698177] [001.2] [pid=30720] Handling: 209=1
[1478252888.698194] [001.2] [pid=30720] Handling: 262=IDSERVICE:0:11700
[1478252888.698229] [001.2] [pid=30720] Handling: 262=DISPLAYNAME:0:MS-WIN-SWAP
[1478252888.698250] [001.2] [pid=30720] Handling: 262=HOSTALIAS:0:PC-Philippe
[1478252888.698268] [001.2] [pid=30720] Handling: 262=DOCUMENTATION:0:
[1478252888.698292] [001.2] [pid=30720] Queue Message: 42=0.45200
113=0
45=0
103=1
93=1
80=0
37=
11=check_traffic!coservit!65541!2000!3000!k!2!!!
86=5.00000
109=1.00000
209=1
262=IDSERVICE:0:11703
262=DISPLAYNAME:0:Network_traffic
262=HOSTALIAS:0:PC-Philippe
262=DOCUMENTATION:0:
262=INSTRUCTIONS:0:
999

212:
1=1201
2=0
3=0
4=1478252886.959403
53=nUtQmrPuAKVkdN9q7b8L1Dvse
95=OK - 192.168.238.8: rta 0,393ms, lost 0%
125=
99=rta=0,393ms;1501,000;2000,000;0; pl=0%;40;80;; rtmax=0,577ms;;;; rtmin=0,329ms;;;;
27=0
51=1
115=1
25=1
76=1
58=1478252886
81=1478252946
12=0
63=146650003
[1478252886.972369] [001.2] [pid=30720] Handling: 109=1.00000
[1478252886.972389] [001.2] [pid=30720] Handling: 162=107
[1478252886.972403] [001.2] [pid=30720] Handling: 262=COMPANYNAME:0:SAMIR
[1478252886.972417] [001.2] [pid=30720] Handling: 262=IDHOST:0:1649
[1478252886.972437] [001.2] [pid=30720] Handling: 262=HOST_CATEGORY_NAME:0:Serveur Windows
[1478252886.972457] [001.2] [pid=30720] Handling: 262=HOST_TAGS:0:Hors Contrat,Test_rapport_sig_admin_1,Test Karen 3
[1478252886.972477] [001.2] [pid=30720] Handling: 262=DOCUMENTATION:0:
[1478252886.972497] [001.2] [pid=30720] Handling: 262=INSTRUCTIONS:0:
[1478252886.972517] [001.2] [pid=30720] Handling: 262=DISPLAYNAME:0:Ping
[1478252886.972537] [001.2] [pid=30720] Handling: 999
[1478252886.981367] [001.2] [pid=30720] Handling: 
[1478252886.981400] [001.2] [pid=30720] Handling: 
[1478252886.981418] [001.2] [pid=30720] Handling: 212:
[1478252886.981438] [001.2] [pid=30720] Handling: 1=1201
[1478252886.981456] [001.2] [pid=30720] Handling: 2=0
[1478252886.981473] [001.2] [pid=30720] Handling: 3=0
[1478252886.981490] [001.2] [pid=30720] Handling: 4=1478252886.959403
[1478252886.981508] [001.2] [pid=30720] Handling: 53=nUtQmrPuAKVkdN9q7b8L1Dvse
[1478252886.981526] [001.2] [pid=30720] Handling: 95=OK - 192.168.238.8: rta 0,393ms, lost 0%
[1478252886.981544] [001.2] [pid=30720] Handling: 125=
[1478252886.981561] [001.2] [pid=30720] Handling: 99=rta=0,393ms;1501,000;2000,000;0; pl=0%;40;80;; rtmax=0,577ms;;;; rtmin=0,329ms;;;;
[1478252886.981579] [001.2] [pid=30720] Handling: 27=0
[1478252886.981596] [001.2] [pid=30720] Handling: 51=1
[1478252886.981613] [001.2] [pid=30720] Handling: 115=1
[1478252886.981630] [001.2] [pid=30720] Handling: 25=1
[1478252886.981647] [001.2] [pid=30720] Handling: 76=1
[1478252886.981664] [001.2] [pid=30720] Handling: 58=1478252886
[1478252886.981681] [001.2] [pid=30720] Handling: 81=1478252946
[1478252886.981698] [001.2] [pid=30720] Handling: 12=0
[1478252886.981719] [001.2] [pid=30720] Queue Message: 6
57=1466500036
56=0
69=1478252886
65=0
68=0
121=1
59=0
82=0
85=0
88=0
101=0
7=0
26=0
96=0
38=1
8=1
47=1
54=0
98=0.00000
71=0.00000
42=0.00294
113=0
45=0
103=1
91=1
78=0
37=
11=check_host_alive_icmp!1501!40!2000!80!4!56
86=1.00000
109=1.00000
162=107
262=COMPANYNAME:0:SAMIR
262=IDHOST:0:1649
262=HOST_CATEGORY_NAME:0:Serveur Window
[1478252886.981727] [001.2] [pid=30720] Handling: 63=1466500036
[1478252886.981744] [001.2] [pid=30720] Handling: 57=1466500036
[1478252886.981761] [001.2] [pid=30720] Handling: 56=0
[1478252886.981778] [001.2] [pid=30720] Handling: 69=1478252886
[1478252886.981795] [001.2] [pid=30720] Handling: 65=0
[1478252886.981812] [001.2] [pid=30720] Handling: 68=0
[1478252886.981829] [001.2] [pid=30720] Handling: 121=1
[1478252886.981847] [001.2] [pid=30720] Handling: 59=0
[1478252886.981864] [001.2] [pid=30720] Handling: 82=0
[1478252886.981881] [001.2] [pid=30720] Handling: 85=0
[1478252886.981898] [001.2] [pid=30720] Handling: 88=0
[1478252886.981915] [001.2] [pid=30720] Handling: 101=0
[1478252886.981949] [001.2] [pid=30720] Handling: 7=0
[1478252886.981966] [001.2] [pid=30720] Handling: 26=0
[1478252886.981984] [001.2] [pid=30720] Handling: 96=0
[1478252886.982001] [001.2] [pid=30720] Handling: 38=1
[1478252886.982018] [001.2] [pid=30720] Handling: 8=1
[1478252886.982035] [001.2] [pid=30720] Handling: 47=1
[1478252886.982052] [001.2] [pid=30720] Handling: 54=0
[1478252886.982068] [001.2] [pid=30720] Handling: 98=0.00000
[1478252886.982085] [001.2] [pid=30720] Handling: 71=0.00000
[1478252886.982102] [001.2] [pid=30720] Handling: 42=0.00294
[1478252886.982119] [001.2] [pid=30720] Handling: 113=0
[1478252886.982136] [001.2] [pid=30720] Handling: 45=0
[1478252886.982153] [001.2] [pid=30720] Handling: 103=1
[1478252886.982170] [001.2] [pid=30720] Handling: 91=1
[1478252886.982187] [001.2] [pid=30720] Handling: 78=0
[1478252886.982203] [001.2] [pid=30720] Handling: 37=
[1478252886.982247] [001.2] [pid=30720] Handling: 11=check_host_alive_icmp!1501!40!2000!80!4!56
[1478252886.982267] [001.2] [pid=30720] Handling: 86=1.00000
[1478252886.982284] [001.2] [pid=30720] Handling: 109=1.00000
[1478252886.982301] [001.2] [pid=30720] Handling: 162=107
[1478252886.982313] [001.2] [pid=30720] Handling: 262=COMPANYNAME:0:SAMIR
[1478252886.982325] [001.2] [pid=30720] Handling: 262=IDHOST:0:1649
[1478252886.985556] [001.2] [pid=30720] Queue Message: s
262=HOST_TAGS:0:Hors Contrat,Test_rapport_sig_admin_1,Test Karen 3
262=DOCUMENTATION:0:
262=INSTRUCTIONS:0:
262=DISPLAYNAME:0:Ping
999

The problematic line is:

[1478252888.697698] [001.2] [pid=30720] Handling: 61=147825274178860

Nearly all packets auround this line are mixed up.

I investigated the file in whole and everything is fine until following exchange:

[1478252886.985577] [001.2] [pid=30720] Handling: 262=HOST_CATEGORY_NAME:0:Serveur Windows
[1478252886.985602] [001.2] [pid=30720] Handling: 262=HOST_TAGS:0:Hors Contrat,Test_rapport_sig_admin_1,Test Karen 3
[1478252886.985623] [001.2] [pid=30720] Handling: 262=DOCUMENTATION:0:
[1478252886.985643] [001.2] [pid=30720] Handling: 262=INSTRUCTIONS:0:
[1478252886.985663] [001.2] [pid=30720] Handling: 262=DISPLAYNAME:0:Ping
[1478252886.985683] [001.2] [pid=30720] Handling: 999
[1478252886.993857] [001.2] [pid=30720] Handling: 
[1478252887.662883] [001.2] [pid=30720] Queue Message: 

HELLO
PROTOCOL: 2
AGENT: NDOMOD
AGENTVERSION: 2.1.1
STARTTIME: 1478252887
DISPOSITION: REALTIME
CONNECTION: TCPSOCKET
CONNECTTYPE: INITIAL
INSTANCENAME: 55
STARTDATADUMP

200:
1=104
2=0
3=0
4=1478252887.636646
105=Nagios
107=4.2.1
104=09-06-2016
102=16032
999

414:
269=408
1=MS-MSSQL-database-free-dbname
2=a-check-VSUpdateAgent-Status
3=a-check-win-cpu
4=a-check-win-diskIO
5=a-check-win-diskspace
6=a-check-win-eventlog
7=a-check-win-ram
8=a-check-win-service-not-started
9=a-check-win-swap
10=a-check-w
[1478252887.662912] [001.2] [pid=30720] Handling: 
[1478252887.662940] [001.2] [pid=30720] Handling: 
[1478252887.662959] [001.2] [pid=30720] Handling: HELLO
[1478252887.662987] [001.2] [pid=30720] Handling: PROTOCOL: 2
[1478252887.663010] [001.2] [pid=30720] Handling: AGENT: NDOMOD
[1478252887.663032] [001.2] [pid=30720] Handling: AGENTVERSION: 2.1.1
[1478252887.663052] [001.2] [pid=30720] Handling: STARTTIME: 1478252887
[1478252887.663071] [001.2] [pid=30720] Handling: DISPOSITION: REALTIME
[1478252887.663089] [001.2] [pid=30720] Handling: CONNECTION: TCPSOCKET
[1478252887.663108] [001.2] [pid=30720] Handling: CONNECTTYPE: INITIAL
[1478252887.663126] [001.2] [pid=30720] Handling: INSTANCENAME: 55
[1478252887.663145] [001.2] [pid=30720] Handling: STARTDATADUMP
[1478252887.663163] [001.2] [pid=30720] Handling: 
[1478252887.663180] [001.2] [pid=30720] Handling: 
[1478252887.663198] [001.2] [pid=30720] Handling: 
[1478252887.663214] [001.2] [pid=30720] Handling: 200:
[1478252887.663232] [001.2] [pid=30720] Handling: 1=104
[1478252887.663251] [001.2] [pid=30720] Handling: 2=0
[1478252887.663268] [001.2] [pid=30720] Handling: 3=0
[1478252887.663286] [001.2] [pid=30720] Handling: 4=1478252887.636646
[1478252887.663303] [001.2] [pid=30720] Handling: 105=Nagios
[1478252887.663321] [001.2] [pid=30720] Handling: 107=4.2.1
[1478252887.663339] [001.2] [pid=30720] Handling: 104=09-06-2016
[1478252887.663362] [001.2] [pid=30720] Handling: 102=16032
[1478252887.663397] [001.2] [pid=30720] Handling: 999
[1478252888.136390] [001.2] [pid=30720] Handling: 
[1478252888.136428] [001.2] [pid=30720] Handling: 
[1478252888.136446] [001.2] [pid=30720] Handling: 414:
[1478252888.136469] [001.2] [pid=30720] Handling: 269=408
[1478252888.136482] [001.2] [pid=30720] Handling: 1=MS-MSSQL-database-free-dbname
[1478252888.136494] [001.2] [pid=30720] Handling: 2=a-check-VSUpdateAgent-Status
[1478252888.136512] [001.2] [pid=30720] Handling: 3=a-check-win-cpu
[1478252888.136529] [001.2] [pid=30720] Handling: 4=a-check-win-diskIO
[1478252888.136547] [001.2] [pid=30720] Handling: 5=a-check-win-diskspace
[1478252888.136559] [001.2] [pid=30720] Handling: 6=a-check-win-eventlog
[1478252888.136577] [001.2] [pid=30720] Handling: 7=a-check-win-ram
[1478252888.136595] [001.2] [pid=30720] Handling: 8=a-check-win-service-not-started
[1478252888.136612] [001.2] [pid=30720] Handling: 9=a-check-win-swap
[1478252888.136635] [001.2] [pid=30720] Queue Message: =0
212=1
9=1
206=1
205=1
225=0
93=1
45=0
186=
187=
126=
179=
180=
266=0
262=IDSERVICE:0:11886
262=DISPLAYNAME:0:check_vsp_process_health_VS_CommandProcessing
262=HOSTALIAS:0:VSP PREPROD
262=DOCUMENTATION:0:
262=INSTRUCTIONS:0:
999

after that everything is mixed-up. The HELLO sequence what you can see in the list above comes from 'service nagios reload' on the box where ndomod.o is running.

I tried to follow the code, tried to debug the code, wrote special 'nagios simulator' to attach directly to ndomod.o (small app in C++) but found no explanation for this message mix-up.

I am out of ideas for now, but maybe you can have any idea, or experience with similar problem ?? Or maybe somebody else have seen the same and found a solution ??

My last suspicion is that after the Nagios (4.2.1) reload there was more than one 'nagios with ndomod.o' sub-process running on the Nagios box - one failed to be killed during reload, one created after reload - and for some reason they both write to the same port and the same ndo2db process. But unfortunately the box was restarted in the meantime and I can't confirm that hypotesis right now.

Right now I am waiting for another example of the mysql_error to see if there is a multiplication of Nagios with ndomod.o processes to see if that can be the cause.

Can you please tell me if you have another idea how to investigate the problem ??

Just for reference, we've never seen anything like it with ndoutils 1.4b7, Nagios 3.5.

I am willing to assist you with any information you can need, any debugging I can do or logs to give you (this small excerise took 1GB of logs)

Thanks in advance,

Daniel

Coservit commented 7 years ago

I found a culprit of the mixed-up messages.

The problem is the call to ftok function inside get_queue_id. Unfortunately the ftok with the current parameters (".", 9504 + pid) returns the same id for all pid(s) with the same least significant byte.

The ftok takes only the last byte into account from the id parameter (9504 + pid) in our case. We had a case of unfortunate coincidence that two active processes (23635 and 17747) tried to use the same IPC queue with the same id (1929423995).

I wrote the small program which only calls 'ftok' to generate the id for pid = 23635 and pid = 17747 and the result is:

key for 23635: 1929423995 key for 17747: 1929423995

so as you can see we have a collision for two numbers because of the equality of their last byte:

hex(9504+23635) = 0x8173 hex(9504+17747) = 0x6a73

Only the 0x73 is taken in account and as the first parameter is simple "." we have a collision.

I changed the code to generate a file in /tmp folder with pid in the name:

/tmp/ndo2db.24655.ipc.id for example and thus increasing the entropy.

Here is the new code of get_queue_id:

static char *queue_idfile_path = NULL;

int get_queue_id(int id) {

    char fname[128];
    int idfile;

    if (queue_idfile_path == NULL)
    {
        asprintf(&queue_idfile_path, "/tmp/ndo2db.%d.ipc.id", id);

        idfile=open(queue_idfile_path,O_RDWR | O_CREAT, S_IWUSR | S_IRUSR);
        if (idfile < 0) {
            free(queue_idfile_path);
            queue_idfile_path = NULL;
        } else {
            close(idfile);
        }
    }

        if (queue_id_file_path == NULL)
             return -1;

    key_t key = ftok(queue_idfile_path, NDO_QUEUE_ID+id);

    if (key == -1 || (queue_id = msgget(key, IPC_CREAT | 0600)) < 0) {
        syslog(LOG_ERR,"Error: queue init error.\n");
    }

    return queue_id;
}

Removal of the ipc.id file is done in del_queue:

void del_queue() {
.
.
.

    if (queue_idfile_path != NULL)
    {
        unlink(queue_idfile_path);
        free(queue_idfile_path);
        queue_idfile_path = NULL;
    }

}

The added pid file should assure good generation of IPC key id.

Or we should maybe consider to add support for mq_open, etc. POSIX queues ? But I am not entirely sure that we can use POSIX in the code for compatibility reason. Possibly using compilation macro through ./configure ???

jfrickson commented 7 years ago

Good find! Can you do a pull request? That way you get better credit.

Coservit commented 7 years ago

Sorry for the delay - I created the PR #29. It contains also a fix for compilation error when --enable-ssl is passed to ./configure.

hedenface commented 7 years ago

I'm closing the comment since the real action is in the PR. Thanks!

hedenface commented 7 years ago

I'm going to say this, and only this:

NDO has been approved for a major overhaul (can you say multithreaded neb module with no weird abstraction, anyone?). I'll be getting started on this pretty soon.

I merged the PR #29 so that it is in there for reference for the rewrite. I didn't test it, personally. So if someone can confirm it works on say gentoo or cent or rhel, that'd be cool.

NagiosEnterprises / ndoutils

Wrong message order in ndo2db #24