Closed hhoffstaette closed 8 years ago
This error is caused when the OS has a net.ipv4.tcp_fin_timeout < the plugin's polling interval:
My server has net.ipv4.tcp_fin_timeout configured to a lower-than-default value, and with a collectd polling interval larger than that value I see errors. Polling faster works repeatably, without errors.
Switching to close_socket=1 by default avoids this problem in the first place, and IMHO should be the default.
I can confirm this. I am also running Gentoo.
increasing the net.ipv4.tcp_fin_timeout seems not to work for me on my arch linux arm
cat /proc/sys/net/ipv4/tcp_fin_timeout
70
my poll intervall is 60s and i can run the poll only one time at start:
Jun 07 11:26:56 systemd[1]: Starting System statistics collection daemon...
Jun 07 11:26:56 systemd[1]: Started System statistics collection daemon.
Jun 07 11:26:57 collectd[23760]: Initialization complete, entering read-loop.
Jun 07 11:27:57 collectd[23760]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 07 11:27:57 collectd[23760]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
I also try to increase the poll interval and load the apcups plugin with an interval of 30s...same behavior:
Jun 07 11:33:27 systemd[1]: Started System statistics collection daemon.
Jun 07 11:33:27 collectd[24011]: Initialization complete, entering read-loop.
Jun 07 11:33:57 collectd[24011]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 07 11:33:57 collectd[24011]: read-function of plugin `apcups' failed. Will suspend it for 60.000 seconds.
At the moment this is a kind of one shot plugin.
EDIT: this is interesting, i leave it running (cooking for the kids) and found this after a couple minutes:
Jun 07 11:39:57 collectd[24011]: apcups plugin: There have been 5 retries in the first 10 iterations. Will close the socket in future iterations.
Jun 07 11:39:57 collectd[24011]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 07 11:39:57 collectd[24011]: read-function of plugin `apcups' failed. Will suspend it for 60.000 seconds.
After that i haven't any errors in the log anymore and the metrics are collected.
@MorphBonehunter No idea why this behaves differently in your case, however with so many variations in yet consistely wrong behaviour it seems to me the sanest course of action is to:
Hi everybody, thanks for reporting this problem. I've implemented the PersistentConnection option which will make is possible for you to configure the open/read/close method if that works better for you. It's still weird that the apcups plugin is having this problem and other plugins work fine. Maybe apcupsd doesn't close the socket fully so that the plugin runs into a timeout?
@hhoffstaette The retry / backoff logic is common to all read plugins and not specific to the apcups plugin. It is very useful when plugins don't work for periods of time and retrying is costly. This is common for (unclean) network partitions, for example.
@octo I understand the (well meant!) reasons for the persistent connection in many plugins. My point was that the default setting does not work reliably and nobody really knows why. Whether it's the parser (just saw that it's now fixed - thanks!) and/or possibly different behaviour by different versions of apcupsd on different models or distros is not really the issue - doing the right thing by default is. Anyone who wants to further tune the behaviour should be able to do so, provided it doesn't break things - no argument there. :)
@hhoffstaette well, changing the default of the new PersistentConnection wouldn't break backwards compatibility by our definition, so we can definitely talk about that.
I would still like to understand what the problem is. Can anybody provide a .pcap file of broken / aborted communication between collectd and apcupsd? // ping @MorphBonehunter
I no longer run collectd and therefore cannot provide a pcap dump, but have a suspicion that fixing the token parser might have helped. It's all very ugly, but if you want to follow along:
I suspect that either the token parser getting confused and/or the NULL terminator are the reason why apcupsd always seems to close the socket after every request.
Ok, i've captured a "full cycle" of the communication with apcupsd. I've started an tcpdump, start collectd, wait until the log says "Will close the socket in future iterations." and end the tcpdump a little bit later (so there should be 2-3 intervals which collected with closing the socket). The log is here to compare the times:
Jun 8 19:50:06 collectd[17383]: Initialization complete, entering read-loop.
Jun 8 19:51:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 8 19:51:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun 8 19:54:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 8 19:54:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun 8 19:57:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 8 19:57:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun 8 20:00:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 8 20:00:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun 8 20:03:06 collectd[17383]: apcups plugin: There have been 5 retries in the first 10 iterations. Will close the socket in future iterations.
Jun 8 20:03:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 8 20:03:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
You could find the dump file here: https://drive.google.com/file/d/0BxXpdkbILl3SRVNkMU8zZjFYYnc/view?usp=sharing
I've take a look at the dump and if i'm not wrong the problem is, that apcupsd sends an FIN after the status was delivered to collectd but collectd doesn't catch that and try to reuse the connection after the specified interval which is answered with RST from apcupsd.
Thanks @MorphBonehunter!
After looking at the dump I'm actually not so sure fb075e15e3400afdb691b88f1e1a4a38578aaba9 is doing the right thing: I don't think we read the trailing two nul bytes, aborting the read loop prematurely.
Hi!
I don't think we read the trailing two nul bytes, aborting the read loop prematurely.
Sure, confirm this, this patch breaks things!
As the socket has unread data, net_recv() call returns 0 due to packet_size == 0 in net_recv().
This leads to apc_query_server() also return 0, and default values are sent by apcups_read().
So, dispatched values are flapping between real data and default values.
As result, plugin is unuseable unless 'PersistentConnection false' is configured.
Please see debug logs attached:
root@flash:/opt/collectd# ./run option = PIDFile; value = /var/run/collectd-xen.pid; option = Hostname; value = test; option = FQDNLookup; value = false; option = PIDFile; value = /var/run/collectd-xen.pid; Configfile: Ignoring
PIDFILE' option because command-line option -P' take precedence. Created new plugin context. [2016-03-07 03:05:37] plugin_load: plugin "network" successfully loaded. [2016-03-07 03:05:37] plugin_load: plugin "apcups" successfully loaded. [2016-03-07 03:05:37] Done parsing
/opt/collectd/share/collectd/types.db'
[2016-03-07 03:05:37] interval_g = 10.000;
[2016-03-07 03:05:37] timeout_g = 2;
[2016-03-07 03:05:37] hostname_g = test;
[2016-03-07 03:05:37] Initialization complete, entering read-loop.
[2016-03-07 03:05:37] plugin_read_thread: Handling apcups'. [2016-03-07 03:05:37] Created new plugin context. [2016-03-07 03:05:37] apcups plugin: Done opening a socket 3 net_recv =
APC : 001,053,1353
';net_recv = DATE : 2016-03-07 03:05:35 +0600 ';net_recv =
HOSTNAME : flash
';net_recv = VERSION : 3.14.10 (13 September 2011) debian ';net_recv =
UPSNAME : UPS_IDEN
';net_recv = CABLE : Custom Cable Smart ';net_recv =
DRIVER : APC Smart UPS (any)
';net_recv = UPSMODE : Stand Alone ';net_recv =
STARTTIME: 2015-10-22 16:42:07 +0600
';net_recv = MODEL : SMART-UPS 1000 ';net_recv =
STATUS : ONLINE
';net_recv = LINEV : 226.2 Volts '; Found property: name = LINEV; value = 226.200000; net_recv =
LOADPCT : 21.8 Percent Load Capacity
'; Found property: name = LOADPCT; value = 21.800000;
net_recv = BCHARGE : 100.0 Percent '; Found property: name = BCHARGE; value = 100.000000; net_recv =
TIMELEFT : 103.0 Minutes
'; Found property: name = TIMELEFT; value = 103.000000;
net_recv = MBATTCHG : 15 Percent ';net_recv =
MINTIMEL : 10 Minutes
';net_recv = MAXTIME : 1800 Seconds ';net_recv =
MAXLINEV : 226.2 Volts
';net_recv = MINLINEV : 224.9 Volts ';net_recv =
OUTPUTV : 226.2 Volts
'; Found property: name = OUTPUTV; value = 226.200000;
net_recv = SENSE : High ';net_recv =
DWAKE : 180 Seconds
';net_recv = DSHUTD : 020 Seconds ';net_recv =
DLOWBATT : 05 Minutes
';net_recv = LOTRANS : 196.0 Volts ';net_recv =
HITRANS : 253.0 Volts
';net_recv = RETPCT : 015.0 Percent ';net_recv =
ITEMP : 35.5 C Internal
'; Found property: name = ITEMP; value = 35.500000;
net_recv = ALARMDEL : 30 seconds ';net_recv =
BATTV : 27.6 Volts
'; Found property: name = BATTV; value = 27.600000;
net_recv = LINEFREQ : 50.0 Hz '; Found property: name = LINEFREQ; value = 50.000000; net_recv =
LASTXFER : Automatic or explicit self test
';net_recv = NUMXFERS : 14 ';net_recv =
XONBATT : 2016-02-03 04:19:49 +0600
';net_recv = TONBATT : 0 seconds ';net_recv =
CUMONBATT: 1682 seconds
';net_recv = XOFFBATT : 2016-02-03 04:19:50 +0600 ';net_recv =
LASTSTEST: 2015-11-03 10:12:58 +0600
';net_recv = SELFTEST : NO ';net_recv =
STESTI : 336
';net_recv = STATFLAG : 0x07000008 Status Flag ';net_recv =
DIPSW : 0x00 Dip Switch
';net_recv = REG1 : 0x00 Register 1 ';net_recv =
REG2 : 0x00 Register 2
';net_recv = REG3 : 0x00 Register 3 ';net_recv =
MANDATE : 10/04/00
';net_recv = SERIALNO : QS0041213583 ';net_recv =
BATTDATE : 15/05/13
';net_recv = NOMOUTV : 230 Volts ';net_recv =
NOMBATTV : 24.0 Volts
';net_recv = EXTBATTS : 0 ';net_recv =
FIRMWARE : 60.11.I
';net_recv = END APC : 2016-03-07 03:05:37 +0600 ';[2016-03-07 03:05:37] plugin_read_thread: read-function of the
apcups' plugin took 0.642169 seconds.
[2016-03-07 03:05:37] Created new plugin context.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = input;
[2016-03-07 03:05:37] Created new plugin context.
[2016-03-07 03:05:37] plugin_read_thread: Effective interval of the apcups' plugin is 10.000 seconds. [2016-03-07 03:05:37] plugin_read_thread: Next read of the
apcups' plugin at 1457298346.733.
[2016-03-07 03:05:37] Created new plugin context.
[2016-03-07 03:05:37] Created new plugin context.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = charge; type_instance = ;
[2016-03-07 03:05:37] uc_insert: Added test/apcups/voltage-input to the cache.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = output;
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = timeleft; type_instance = ;
[2016-03-07 03:05:37] uc_insert: Added test/apcups/timeleft to the cache.
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = temperature; type_instance = ;
[2016-03-07 03:05:37] uc_insert: Added test/apcups/temperature to the cache.
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = frequency; type_instance = input;
[2016-03-07 03:05:37] uc_insert: Added test/apcups/frequency-input to the cache.
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:37] Created new plugin context.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = percent; type_instance = load;
[2016-03-07 03:05:37] uc_insert: Added test/apcups/percent-load to the cache.
[2016-03-07 03:05:37] uc_insert: Added test/apcups/voltage-output to the cache.
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:37] uc_insert: Added test/apcups/charge to the cache.
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = battery;
[2016-03-07 03:05:37] uc_insert: Added test/apcups/voltage-battery to the cache.
[2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.
/// DEFAULT VALUES DISPATCHED
[2016-03-07 03:05:47] plugin_read_thread: Handling apcups'. [2016-03-07 03:05:47] plugin_read_thread: read-function of the
apcups' plugin took 0.050284 seconds.
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = battery;
[2016-03-07 03:05:47] uc_update: test/apcups/voltage-battery: ds[0] = -1.000000
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = percent; type_instance = load;
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = timeleft; type_instance = ;
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = input;
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = output;
[2016-03-07 03:05:47] plugin_read_thread: Effective interval of the apcups' plugin is 10.000 seconds. [2016-03-07 03:05:47] plugin_read_thread: Next read of the
apcups' plugin at 1457298356.733.
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = charge; type_instance = ;
[2016-03-07 03:05:47] uc_update: test/apcups/percent-load: ds[0] = -1.000000
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = temperature; type_instance = ;
[2016-03-07 03:05:47] uc_update: test/apcups/temperature: ds[0] = -300.000000
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = frequency; type_instance = input;
[2016-03-07 03:05:47] uc_update: test/apcups/frequency-input: ds[0] = -1.000000
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:47] uc_update: test/apcups/timeleft: ds[0] = nan
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:47] uc_update: test/apcups/voltage-input: ds[0] = -1.000000
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:47] uc_update: test/apcups/voltage-output: ds[0] = -1.000000
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:47] uc_update: test/apcups/charge: ds[0] = -1.000000
[2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.
/// REAL DATA AGAIN
[2016-03-07 03:05:57] plugin_read_thread: Handling apcups'. net_recv =
APC : 001,053,1353
';net_recv = DATE : 2016-03-07 03:05:35 +0600 ';net_recv =
HOSTNAME : flash
';net_recv = VERSION : 3.14.10 (13 September 2011) debian ';net_recv =
UPSNAME : UPS_IDEN
';net_recv = CABLE : Custom Cable Smart ';net_recv =
DRIVER : APC Smart UPS (any)
';net_recv = UPSMODE : Stand Alone ';net_recv =
STARTTIME: 2015-10-22 16:42:07 +0600
';net_recv = MODEL : SMART-UPS 1000 ';net_recv =
STATUS : ONLINE
';net_recv = LINEV : 226.2 Volts '; Found property: name = LINEV; value = 226.200000; net_recv =
LOADPCT : 21.8 Percent Load Capacity
'; Found property: name = LOADPCT; value = 21.800000;
net_recv = BCHARGE : 100.0 Percent '; Found property: name = BCHARGE; value = 100.000000; net_recv =
TIMELEFT : 103.0 Minutes
'; Found property: name = TIMELEFT; value = 103.000000;
net_recv = MBATTCHG : 15 Percent ';net_recv =
MINTIMEL : 10 Minutes
';net_recv = MAXTIME : 1800 Seconds ';net_recv =
MAXLINEV : 226.2 Volts
';net_recv = MINLINEV : 224.9 Volts ';net_recv =
OUTPUTV : 226.2 Volts
'; Found property: name = OUTPUTV; value = 226.200000;
net_recv = SENSE : High ';net_recv =
DWAKE : 180 Seconds
';net_recv = DSHUTD : 020 Seconds ';net_recv =
DLOWBATT : 05 Minutes
';net_recv = LOTRANS : 196.0 Volts ';net_recv =
HITRANS : 253.0 Volts
';net_recv = RETPCT : 015.0 Percent ';net_recv =
ITEMP : 35.5 C Internal
'; Found property: name = ITEMP; value = 35.500000;
net_recv = ALARMDEL : 30 seconds ';net_recv =
BATTV : 27.6 Volts
'; Found property: name = BATTV; value = 27.600000;
net_recv = LINEFREQ : 50.0 Hz '; Found property: name = LINEFREQ; value = 50.000000; net_recv =
LASTXFER : Automatic or explicit self test
';net_recv = NUMXFERS : 14 ';net_recv =
XONBATT : 2016-02-03 04:19:49 +0600
';net_recv = TONBATT : 0 seconds ';net_recv =
CUMONBATT: 1682 seconds
';net_recv = XOFFBATT : 2016-02-03 04:19:50 +0600 ';net_recv =
LASTSTEST: 2015-11-03 10:12:58 +0600
';net_recv = SELFTEST : NO ';net_recv =
STESTI : 336
';net_recv = STATFLAG : 0x07000008 Status Flag ';net_recv =
DIPSW : 0x00 Dip Switch
';net_recv = REG1 : 0x00 Register 1 ';net_recv =
REG2 : 0x00 Register 2
';net_recv = REG3 : 0x00 Register 3 ';net_recv =
MANDATE : 10/04/00
';net_recv = SERIALNO : QS0041213583 ';net_recv =
BATTDATE : 15/05/13
';net_recv = NOMOUTV : 230 Volts ';net_recv =
NOMBATTV : 24.0 Volts
';net_recv = EXTBATTS : 0 ';net_recv =
FIRMWARE : 60.11.I
';net_recv = END APC : 2016-03-07 03:05:46 +0600 ';[2016-03-07 03:05:57] plugin_read_thread: read-function of the
apcups' plugin took 0.050556 seconds.
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = input;
[2016-03-07 03:05:57] uc_update: test/apcups/voltage-input: ds[0] = 226.200000
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = output;
[2016-03-07 03:05:57] uc_update: test/apcups/voltage-output: ds[0] = 226.200000
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = charge; type_instance = ;
[2016-03-07 03:05:57] uc_update: test/apcups/charge: ds[0] = 100.000000
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = timeleft; type_instance = ;
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = temperature; type_instance = ;
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = percent; type_instance = load;
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = battery;
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:57] uc_update: test/apcups/timeleft: ds[0] = 6180.000000
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = frequency; type_instance = input;
[2016-03-07 03:05:57] plugin_read_thread: Effective interval of the apcups' plugin is 10.000 seconds. [2016-03-07 03:05:57] plugin_read_thread: Next read of the
apcups' plugin at 1457298366.733.
[2016-03-07 03:05:57] uc_update: test/apcups/temperature: ds[0] = 35.500000
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:57] uc_update: test/apcups/percent-load: ds[0] = 21.800000
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:57] uc_update: test/apcups/voltage-battery: ds[0] = 27.600000
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
[2016-03-07 03:05:57] uc_update: test/apcups/frequency-input: ds[0] = 50.000000
[2016-03-07 03:05:57] plugin: plugin_write: Writing values via network.
^C[2016-03-07 03:05:59] Exiting normally.
[2016-03-07 03:05:59] collectd: Stopping 5 read threads.
[2016-03-07 03:05:59] plugin: stop_read_threads: Signalling read_cond' [2016-03-07 03:05:59] collectd: Stopping 5 write threads. [2016-03-07 03:05:59] plugin: stop_write_threads: Signalling
write_cond'
`
Apcupsd NIS code is open, I'm looking into code of apcupsd-3.14.13 from https://sourceforge.net/projects/apcupsd/files/apcupsd%20-%20Stable/3.14.13/
src/apcnis.c:226 handle_client_request() is the function which handles connections from collectd plugin. There is an infinite loop of handling requests.
src/lib/apclibnis.c:61 read_nbytes() is the function which reads data from socket. It does read with hard-coded timeout of 15 sec. There is no other reasons to close socket by server unless connection is terminated by client or wrong command received.
The fb075e15e3400afdb691b88f1e1a4a38578aaba9 needs to be reverted. Result of net_recv() == 0 is the only reason to stop socket reading. Reverting that patch solves the problem with flapping when persistent connection is configured.
By default collectd has 'Interval 10'. When running collectd with 'Interval 20' (more than 15 sec) apcupsd closes connection by timeout before collectd makes its next request. As result, 'Bad file descriptor' message appears in log.
@hhoffstaette (reporter of this issue) has interval 60 seconds configured, as we can see from his log provided. (120 seconds from the line "read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds." is the 60 seconds interval multiplied twice). That is the real reason of the reported problem.
I will offer a patch that will change the default value of 'PersistentConnection', depending on the configured read interval.
Created #1701 as a solution for this issue.
I believe @rpv-tomsk's patches from #1701 fixed this problem. Please test, and complain here if you still encounter problems !
I'm seeing a weird error with the apcups plugin on otherwise perfectly working Gentoo. When I activate the plugin - with or without explicit configuration - it complains:
The apcupsd daemon is of course running, accessible and working fine; apcaccess and the CGIs work correctly, as root or regular user.
In order to debug this I rigged up a simple in-tree standalone executable that just links against the apcups plugin's .o (and required friends) and runs the socket code. The result: everything works!
The result, with the APCMAIN macro enabled and run against my live UPS, shows:
As you can see the values are being read properly. What I find curious is the last line ("END APC : ...") seems to be mistreated by the tokenizer and might be the reason for the ensuing confusion, despite the fact that only select key/value paris are considered. However, even repeated calls to apc_read() in a loop don't result in socket errors or crashes, so reusing the socket or socket buffer overruns are apparently not the issue.
Any ideas what might be going on here?