collectd / collectd

The system statistics collection daemon. Please send Pull Requests here!
http://collectd.org
Other
3.19k stars 1.23k forks source link

apcups plugin: socket errors when poll time is 'too large' #617

Closed hhoffstaette closed 8 years ago

hhoffstaette commented 10 years ago

I'm seeing a weird error with the apcups plugin on otherwise perfectly working Gentoo. When I activate the plugin - with or without explicit configuration - it complains:

[2014-05-19 10:31:22] [error] apcups plugin: Reading from socket failed: Bad file descriptor
[2014-05-19 10:31:22] [notice] read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.

The apcupsd daemon is of course running, accessible and working fine; apcaccess and the CGIs work correctly, as root or regular user.

In order to debug this I rigged up a simple in-tree standalone executable that just links against the apcups plugin's .o (and required friends) and runs the socket code. The result: everything works!

// un-static-fied in apcups.c & call to apc_submit() disabled
extern int apcups_config(const char* key, const char* value);
extern int apcups_read(void);

int main(int argc, char** argv)
{
    apcups_config("host", "localhost");
    apcups_config("port", "3551");
    return apcups_read();
}

The result, with the APCMAIN macro enabled and run against my live UPS, shows:

holger>./testapc                                                                                                      
net_recv = `APC      : 001,034,0864
';
  Found property: name = APC; value = 1.000000;
net_recv = `DATE     : 2014-05-19 15:35:10 +0200  
';
  Found property: name = DATE; value = 2014.000000;
  Found property: name = 15; value = 35.000000;
net_recv = `HOSTNAME : tux
';
  Found property: name = HOSTNAME; value = 0.000000;
net_recv = `VERSION  : 3.14.10 (13 September 2011) gentoo
';
  Found property: name = VERSION; value = 3.140000;
net_recv = `UPSNAME  : tux
';
  Found property: name = UPSNAME; value = 0.000000;
net_recv = `CABLE    : USB Cable
';
  Found property: name = CABLE; value = 0.000000;
net_recv = `DRIVER   : USB UPS Driver
';
  Found property: name = DRIVER; value = 0.000000;
net_recv = `UPSMODE  : Stand Alone
';
  Found property: name = UPSMODE; value = 0.000000;
net_recv = `STARTTIME: 2014-05-18 21:46:08 +0200  
';
  Found property: name = STARTTIME; value = 2014.000000;
  Found property: name = 21; value = 46.000000;
net_recv = `MODEL    : Back-UPS ES 700G 
';
  Found property: name = MODEL; value = 0.000000;
net_recv = `STATUS   : ONLINE 
';
  Found property: name = STATUS; value = 0.000000;
net_recv = `LINEV    : 236.0 Volts
';
  Found property: name = LINEV; value = 236.000000;
net_recv = `LOADPCT  :  13.0 Percent Load Capacity
';
  Found property: name = LOADPCT; value = 13.000000;
net_recv = `BCHARGE  : 100.0 Percent
';
  Found property: name = BCHARGE; value = 100.000000;
net_recv = `TIMELEFT :  29.6 Minutes
';
  Found property: name = TIMELEFT; value = 29.600000;
net_recv = `MBATTCHG : 0 Percent
';
  Found property: name = MBATTCHG; value = 0.000000;
net_recv = `MINTIMEL : 0 Minutes
';
  Found property: name = MINTIMEL; value = 0.000000;
net_recv = `MAXTIME  : 0 Seconds
';
  Found property: name = MAXTIME; value = 0.000000;
net_recv = `SENSE    : Medium
';
  Found property: name = SENSE; value = 0.000000;
net_recv = `LOTRANS  : 180.0 Volts
';
  Found property: name = LOTRANS; value = 180.000000;
net_recv = `HITRANS  : 266.0 Volts
';
  Found property: name = HITRANS; value = 266.000000;
net_recv = `ALARMDEL : 30 seconds
';
  Found property: name = ALARMDEL; value = 30.000000;
net_recv = `BATTV    : 13.7 Volts
';
  Found property: name = BATTV; value = 13.700000;
net_recv = `LASTXFER : Automatic or explicit self test
';
  Found property: name = LASTXFER; value = 0.000000;
net_recv = `NUMXFERS : 0
';
  Found property: name = NUMXFERS; value = 0.000000;
net_recv = `TONBATT  : 0 seconds
';
  Found property: name = TONBATT; value = 0.000000;
net_recv = `CUMONBATT: 0 seconds
';
  Found property: name = CUMONBATT; value = 0.000000;
net_recv = `XOFFBATT : N/A
';
  Found property: name = XOFFBATT; value = 0.000000;
net_recv = `STATFLAG : 0x07000008 Status Flag
';
  Found property: name = STATFLAG; value = 117440520.000000;
net_recv = `SERIALNO : 5B1143T07350  
';
  Found property: name = SERIALNO; value = 5.000000;
net_recv = `BATTDATE : 2012-04-15
';
  Found property: name = BATTDATE; value = 2012.000000;
net_recv = `NOMINV   : 230 Volts
';
  Found property: name = NOMINV; value = 230.000000;
net_recv = `NOMBATTV :  12.0 Volts
';
  Found property: name = NOMBATTV; value = 12.000000;
net_recv = `FIRMWARE : 871.O2 .I USB FW:O2
';
  Found property: name = FIRMWARE; value = 871.000000;
  Found property: name = .I USB FW; value = 0.000000;
net_recv = `END APC  : 2014-05-19 15:35:30 +0200  
';
  Found property: name = END; value = 0.000000;
  Found property: name =  ; value = 2014.000000;
  Found property: name = 15; value = 35.000000;

As you can see the values are being read properly. What I find curious is the last line ("END APC : ...") seems to be mistreated by the tokenizer and might be the reason for the ensuing confusion, despite the fact that only select key/value paris are considered. However, even repeated calls to apc_read() in a loop don't result in socket errors or crashes, so reusing the socket or socket buffer overruns are apparently not the issue.

Any ideas what might be going on here?

hhoffstaette commented 10 years ago

This error is caused when the OS has a net.ipv4.tcp_fin_timeout < the plugin's polling interval:

My server has net.ipv4.tcp_fin_timeout configured to a lower-than-default value, and with a collectd polling interval larger than that value I see errors. Polling faster works repeatably, without errors.

Switching to close_socket=1 by default avoids this problem in the first place, and IMHO should be the default.

ConiKost commented 10 years ago

I can confirm this. I am also running Gentoo.

MorphBonehunter commented 9 years ago

increasing the net.ipv4.tcp_fin_timeout seems not to work for me on my arch linux arm

cat /proc/sys/net/ipv4/tcp_fin_timeout
70

my poll intervall is 60s and i can run the poll only one time at start:

Jun 07 11:26:56 systemd[1]: Starting System statistics collection daemon...
Jun 07 11:26:56 systemd[1]: Started System statistics collection daemon.
Jun 07 11:26:57 collectd[23760]: Initialization complete, entering read-loop.
Jun 07 11:27:57 collectd[23760]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 07 11:27:57 collectd[23760]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.

I also try to increase the poll interval and load the apcups plugin with an interval of 30s...same behavior:

Jun 07 11:33:27 systemd[1]: Started System statistics collection daemon.
Jun 07 11:33:27 collectd[24011]: Initialization complete, entering read-loop.
Jun 07 11:33:57 collectd[24011]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 07 11:33:57 collectd[24011]: read-function of plugin `apcups' failed. Will suspend it for 60.000 seconds.

At the moment this is a kind of one shot plugin.

EDIT: this is interesting, i leave it running (cooking for the kids) and found this after a couple minutes:

Jun 07 11:39:57 collectd[24011]: apcups plugin: There have been 5 retries in the first 10 iterations. Will close the socket in future iterations.
Jun 07 11:39:57 collectd[24011]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun 07 11:39:57 collectd[24011]: read-function of plugin `apcups' failed. Will suspend it for 60.000 seconds.

After that i haven't any errors in the log anymore and the metrics are collected.

hhoffstaette commented 9 years ago

@MorphBonehunter No idea why this behaves differently in your case, however with so many variations in yet consistely wrong behaviour it seems to me the sanest course of action is to:

octo commented 9 years ago

Hi everybody, thanks for reporting this problem. I've implemented the PersistentConnection option which will make is possible for you to configure the open/read/close method if that works better for you. It's still weird that the apcups plugin is having this problem and other plugins work fine. Maybe apcupsd doesn't close the socket fully so that the plugin runs into a timeout?

@hhoffstaette The retry / backoff logic is common to all read plugins and not specific to the apcups plugin. It is very useful when plugins don't work for periods of time and retrying is costly. This is common for (unclean) network partitions, for example.

hhoffstaette commented 9 years ago

@octo I understand the (well meant!) reasons for the persistent connection in many plugins. My point was that the default setting does not work reliably and nobody really knows why. Whether it's the parser (just saw that it's now fixed - thanks!) and/or possibly different behaviour by different versions of apcupsd on different models or distros is not really the issue - doing the right thing by default is. Anyone who wants to further tune the behaviour should be able to do so, provided it doesn't break things - no argument there. :)

octo commented 9 years ago

@hhoffstaette well, changing the default of the new PersistentConnection wouldn't break backwards compatibility by our definition, so we can definitely talk about that.

I would still like to understand what the problem is. Can anybody provide a .pcap file of broken / aborted communication between collectd and apcupsd? // ping @MorphBonehunter

hhoffstaette commented 9 years ago

I no longer run collectd and therefore cannot provide a pcap dump, but have a suspicion that fixing the token parser might have helped. It's all very ugly, but if you want to follow along:

I suspect that either the token parser getting confused and/or the NULL terminator are the reason why apcupsd always seems to close the socket after every request.

MorphBonehunter commented 9 years ago

Ok, i've captured a "full cycle" of the communication with apcupsd. I've started an tcpdump, start collectd, wait until the log says "Will close the socket in future iterations." and end the tcpdump a little bit later (so there should be 2-3 intervals which collected with closing the socket). The log is here to compare the times:

Jun  8 19:50:06 collectd[17383]: Initialization complete, entering read-loop.
Jun  8 19:51:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun  8 19:51:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun  8 19:54:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun  8 19:54:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun  8 19:57:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun  8 19:57:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun  8 20:00:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun  8 20:00:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.
Jun  8 20:03:06 collectd[17383]: apcups plugin: There have been 5 retries in the first 10 iterations. Will close the socket in future iterations.
Jun  8 20:03:06 collectd[17383]: apcups plugin: Reading from socket failed: Bad file descriptor
Jun  8 20:03:06 collectd[17383]: read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds.

You could find the dump file here: https://drive.google.com/file/d/0BxXpdkbILl3SRVNkMU8zZjFYYnc/view?usp=sharing

I've take a look at the dump and if i'm not wrong the problem is, that apcupsd sends an FIN after the status was delivered to collectd but collectd doesn't catch that and try to reuse the connection after the specified interval which is answered with RST from apcupsd.

octo commented 9 years ago

Thanks @MorphBonehunter!

After looking at the dump I'm actually not so sure fb075e15e3400afdb691b88f1e1a4a38578aaba9 is doing the right thing: I don't think we read the trailing two nul bytes, aborting the read loop prematurely.

rpv-tomsk commented 8 years ago

Hi!

I don't think we read the trailing two nul bytes, aborting the read loop prematurely.

Sure, confirm this, this patch breaks things!

As the socket has unread data, net_recv() call returns 0 due to packet_size == 0 in net_recv().

This leads to apc_query_server() also return 0, and default values are sent by apcups_read().

So, dispatched values are flapping between real data and default values.

As result, plugin is unuseable unless 'PersistentConnection false' is configured.

Please see debug logs attached:

root@flash:/opt/collectd# ./run option = PIDFile; value = /var/run/collectd-xen.pid; option = Hostname; value = test; option = FQDNLookup; value = false; option = PIDFile; value = /var/run/collectd-xen.pid; Configfile: IgnoringPIDFILE' option because command-line option -P' take precedence. Created new plugin context. [2016-03-07 03:05:37] plugin_load: plugin "network" successfully loaded. [2016-03-07 03:05:37] plugin_load: plugin "apcups" successfully loaded. [2016-03-07 03:05:37] Done parsing/opt/collectd/share/collectd/types.db' [2016-03-07 03:05:37] interval_g = 10.000; [2016-03-07 03:05:37] timeout_g = 2; [2016-03-07 03:05:37] hostname_g = test; [2016-03-07 03:05:37] Initialization complete, entering read-loop. [2016-03-07 03:05:37] plugin_read_thread: Handling apcups'. [2016-03-07 03:05:37] Created new plugin context. [2016-03-07 03:05:37] apcups plugin: Done opening a socket 3 net_recv =APC : 001,053,1353 ';net_recv = DATE : 2016-03-07 03:05:35 +0600 ';net_recv =HOSTNAME : flash ';net_recv = VERSION : 3.14.10 (13 September 2011) debian ';net_recv =UPSNAME : UPS_IDEN ';net_recv = CABLE : Custom Cable Smart ';net_recv =DRIVER : APC Smart UPS (any) ';net_recv = UPSMODE : Stand Alone ';net_recv =STARTTIME: 2015-10-22 16:42:07 +0600 ';net_recv = MODEL : SMART-UPS 1000 ';net_recv =STATUS : ONLINE ';net_recv = LINEV : 226.2 Volts '; Found property: name = LINEV; value = 226.200000; net_recv =LOADPCT : 21.8 Percent Load Capacity '; Found property: name = LOADPCT; value = 21.800000; net_recv = BCHARGE : 100.0 Percent '; Found property: name = BCHARGE; value = 100.000000; net_recv =TIMELEFT : 103.0 Minutes '; Found property: name = TIMELEFT; value = 103.000000; net_recv = MBATTCHG : 15 Percent ';net_recv =MINTIMEL : 10 Minutes ';net_recv = MAXTIME : 1800 Seconds ';net_recv =MAXLINEV : 226.2 Volts ';net_recv = MINLINEV : 224.9 Volts ';net_recv =OUTPUTV : 226.2 Volts '; Found property: name = OUTPUTV; value = 226.200000; net_recv = SENSE : High ';net_recv =DWAKE : 180 Seconds ';net_recv = DSHUTD : 020 Seconds ';net_recv =DLOWBATT : 05 Minutes ';net_recv = LOTRANS : 196.0 Volts ';net_recv =HITRANS : 253.0 Volts ';net_recv = RETPCT : 015.0 Percent ';net_recv =ITEMP : 35.5 C Internal '; Found property: name = ITEMP; value = 35.500000; net_recv = ALARMDEL : 30 seconds ';net_recv =BATTV : 27.6 Volts '; Found property: name = BATTV; value = 27.600000; net_recv = LINEFREQ : 50.0 Hz '; Found property: name = LINEFREQ; value = 50.000000; net_recv =LASTXFER : Automatic or explicit self test ';net_recv = NUMXFERS : 14 ';net_recv =XONBATT : 2016-02-03 04:19:49 +0600 ';net_recv = TONBATT : 0 seconds ';net_recv =CUMONBATT: 1682 seconds ';net_recv = XOFFBATT : 2016-02-03 04:19:50 +0600 ';net_recv =LASTSTEST: 2015-11-03 10:12:58 +0600 ';net_recv = SELFTEST : NO ';net_recv =STESTI : 336 ';net_recv = STATFLAG : 0x07000008 Status Flag ';net_recv =DIPSW : 0x00 Dip Switch ';net_recv = REG1 : 0x00 Register 1 ';net_recv =REG2 : 0x00 Register 2 ';net_recv = REG3 : 0x00 Register 3 ';net_recv =MANDATE : 10/04/00 ';net_recv = SERIALNO : QS0041213583 ';net_recv =BATTDATE : 15/05/13 ';net_recv = NOMOUTV : 230 Volts ';net_recv =NOMBATTV : 24.0 Volts ';net_recv = EXTBATTS : 0 ';net_recv =FIRMWARE : 60.11.I ';net_recv = END APC : 2016-03-07 03:05:37 +0600 ';[2016-03-07 03:05:37] plugin_read_thread: read-function of theapcups' plugin took 0.642169 seconds. [2016-03-07 03:05:37] Created new plugin context. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = input; [2016-03-07 03:05:37] Created new plugin context. [2016-03-07 03:05:37] plugin_read_thread: Effective interval of the apcups' plugin is 10.000 seconds. [2016-03-07 03:05:37] plugin_read_thread: Next read of theapcups' plugin at 1457298346.733. [2016-03-07 03:05:37] Created new plugin context. [2016-03-07 03:05:37] Created new plugin context. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = charge; type_instance = ; [2016-03-07 03:05:37] uc_insert: Added test/apcups/voltage-input to the cache. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = output; [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = timeleft; type_instance = ; [2016-03-07 03:05:37] uc_insert: Added test/apcups/timeleft to the cache. [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = temperature; type_instance = ; [2016-03-07 03:05:37] uc_insert: Added test/apcups/temperature to the cache. [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = frequency; type_instance = input; [2016-03-07 03:05:37] uc_insert: Added test/apcups/frequency-input to the cache. [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:37] Created new plugin context. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = percent; type_instance = load; [2016-03-07 03:05:37] uc_insert: Added test/apcups/percent-load to the cache. [2016-03-07 03:05:37] uc_insert: Added test/apcups/voltage-output to the cache. [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:37] uc_insert: Added test/apcups/charge to the cache. [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:37] plugin_dispatch_values: time = 1457298337.376; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = battery; [2016-03-07 03:05:37] uc_insert: Added test/apcups/voltage-battery to the cache. [2016-03-07 03:05:37] plugin: plugin_write: Writing values via network.

/// DEFAULT VALUES DISPATCHED

[2016-03-07 03:05:47] plugin_read_thread: Handling apcups'. [2016-03-07 03:05:47] plugin_read_thread: read-function of theapcups' plugin took 0.050284 seconds. [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = battery; [2016-03-07 03:05:47] uc_update: test/apcups/voltage-battery: ds[0] = -1.000000 [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = percent; type_instance = load; [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = timeleft; type_instance = ; [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = input; [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = output; [2016-03-07 03:05:47] plugin_read_thread: Effective interval of the apcups' plugin is 10.000 seconds. [2016-03-07 03:05:47] plugin_read_thread: Next read of theapcups' plugin at 1457298356.733. [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = charge; type_instance = ; [2016-03-07 03:05:47] uc_update: test/apcups/percent-load: ds[0] = -1.000000 [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = temperature; type_instance = ; [2016-03-07 03:05:47] uc_update: test/apcups/temperature: ds[0] = -300.000000 [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:47] plugin_dispatch_values: time = 1457298346.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = frequency; type_instance = input; [2016-03-07 03:05:47] uc_update: test/apcups/frequency-input: ds[0] = -1.000000 [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:47] uc_update: test/apcups/timeleft: ds[0] = nan [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:47] uc_update: test/apcups/voltage-input: ds[0] = -1.000000 [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:47] uc_update: test/apcups/voltage-output: ds[0] = -1.000000 [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:47] uc_update: test/apcups/charge: ds[0] = -1.000000 [2016-03-07 03:05:47] plugin: plugin_write: Writing values via network.

/// REAL DATA AGAIN

[2016-03-07 03:05:57] plugin_read_thread: Handling apcups'. net_recv =APC : 001,053,1353 ';net_recv = DATE : 2016-03-07 03:05:35 +0600 ';net_recv =HOSTNAME : flash ';net_recv = VERSION : 3.14.10 (13 September 2011) debian ';net_recv =UPSNAME : UPS_IDEN ';net_recv = CABLE : Custom Cable Smart ';net_recv =DRIVER : APC Smart UPS (any) ';net_recv = UPSMODE : Stand Alone ';net_recv =STARTTIME: 2015-10-22 16:42:07 +0600 ';net_recv = MODEL : SMART-UPS 1000 ';net_recv =STATUS : ONLINE ';net_recv = LINEV : 226.2 Volts '; Found property: name = LINEV; value = 226.200000; net_recv =LOADPCT : 21.8 Percent Load Capacity '; Found property: name = LOADPCT; value = 21.800000; net_recv = BCHARGE : 100.0 Percent '; Found property: name = BCHARGE; value = 100.000000; net_recv =TIMELEFT : 103.0 Minutes '; Found property: name = TIMELEFT; value = 103.000000; net_recv = MBATTCHG : 15 Percent ';net_recv =MINTIMEL : 10 Minutes ';net_recv = MAXTIME : 1800 Seconds ';net_recv =MAXLINEV : 226.2 Volts ';net_recv = MINLINEV : 224.9 Volts ';net_recv =OUTPUTV : 226.2 Volts '; Found property: name = OUTPUTV; value = 226.200000; net_recv = SENSE : High ';net_recv =DWAKE : 180 Seconds ';net_recv = DSHUTD : 020 Seconds ';net_recv =DLOWBATT : 05 Minutes ';net_recv = LOTRANS : 196.0 Volts ';net_recv =HITRANS : 253.0 Volts ';net_recv = RETPCT : 015.0 Percent ';net_recv =ITEMP : 35.5 C Internal '; Found property: name = ITEMP; value = 35.500000; net_recv = ALARMDEL : 30 seconds ';net_recv =BATTV : 27.6 Volts '; Found property: name = BATTV; value = 27.600000; net_recv = LINEFREQ : 50.0 Hz '; Found property: name = LINEFREQ; value = 50.000000; net_recv =LASTXFER : Automatic or explicit self test ';net_recv = NUMXFERS : 14 ';net_recv =XONBATT : 2016-02-03 04:19:49 +0600 ';net_recv = TONBATT : 0 seconds ';net_recv =CUMONBATT: 1682 seconds ';net_recv = XOFFBATT : 2016-02-03 04:19:50 +0600 ';net_recv =LASTSTEST: 2015-11-03 10:12:58 +0600 ';net_recv = SELFTEST : NO ';net_recv =STESTI : 336 ';net_recv = STATFLAG : 0x07000008 Status Flag ';net_recv =DIPSW : 0x00 Dip Switch ';net_recv = REG1 : 0x00 Register 1 ';net_recv =REG2 : 0x00 Register 2 ';net_recv = REG3 : 0x00 Register 3 ';net_recv =MANDATE : 10/04/00 ';net_recv = SERIALNO : QS0041213583 ';net_recv =BATTDATE : 15/05/13 ';net_recv = NOMOUTV : 230 Volts ';net_recv =NOMBATTV : 24.0 Volts ';net_recv = EXTBATTS : 0 ';net_recv =FIRMWARE : 60.11.I ';net_recv = END APC : 2016-03-07 03:05:46 +0600 ';[2016-03-07 03:05:57] plugin_read_thread: read-function of theapcups' plugin took 0.050556 seconds. [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = input; [2016-03-07 03:05:57] uc_update: test/apcups/voltage-input: ds[0] = 226.200000 [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = output; [2016-03-07 03:05:57] uc_update: test/apcups/voltage-output: ds[0] = 226.200000 [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = charge; type_instance = ; [2016-03-07 03:05:57] uc_update: test/apcups/charge: ds[0] = 100.000000 [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = timeleft; type_instance = ; [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = temperature; type_instance = ; [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = percent; type_instance = load; [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = voltage; type_instance = battery; [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:57] uc_update: test/apcups/timeleft: ds[0] = 6180.000000 [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:57] plugin_dispatch_values: time = 1457298356.783; interval = 10.000; host = test; plugin = apcups; plugin_instance = ; type = frequency; type_instance = input; [2016-03-07 03:05:57] plugin_read_thread: Effective interval of the apcups' plugin is 10.000 seconds. [2016-03-07 03:05:57] plugin_read_thread: Next read of theapcups' plugin at 1457298366.733. [2016-03-07 03:05:57] uc_update: test/apcups/temperature: ds[0] = 35.500000 [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:57] uc_update: test/apcups/percent-load: ds[0] = 21.800000 [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:57] uc_update: test/apcups/voltage-battery: ds[0] = 27.600000 [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. [2016-03-07 03:05:57] uc_update: test/apcups/frequency-input: ds[0] = 50.000000 [2016-03-07 03:05:57] plugin: plugin_write: Writing values via network. ^C[2016-03-07 03:05:59] Exiting normally. [2016-03-07 03:05:59] collectd: Stopping 5 read threads. [2016-03-07 03:05:59] plugin: stop_read_threads: Signalling read_cond' [2016-03-07 03:05:59] collectd: Stopping 5 write threads. [2016-03-07 03:05:59] plugin: stop_write_threads: Signallingwrite_cond' `

rpv-tomsk commented 8 years ago

Apcupsd NIS code is open, I'm looking into code of apcupsd-3.14.13 from https://sourceforge.net/projects/apcupsd/files/apcupsd%20-%20Stable/3.14.13/

src/apcnis.c:226 handle_client_request() is the function which handles connections from collectd plugin. There is an infinite loop of handling requests.

src/lib/apclibnis.c:61 read_nbytes() is the function which reads data from socket. It does read with hard-coded timeout of 15 sec. There is no other reasons to close socket by server unless connection is terminated by client or wrong command received.

The fb075e15e3400afdb691b88f1e1a4a38578aaba9 needs to be reverted. Result of net_recv() == 0 is the only reason to stop socket reading. Reverting that patch solves the problem with flapping when persistent connection is configured.

By default collectd has 'Interval 10'. When running collectd with 'Interval 20' (more than 15 sec) apcupsd closes connection by timeout before collectd makes its next request. As result, 'Bad file descriptor' message appears in log.

@hhoffstaette (reporter of this issue) has interval 60 seconds configured, as we can see from his log provided. (120 seconds from the line "read-function of plugin `apcups' failed. Will suspend it for 120.000 seconds." is the 60 seconds interval multiplied twice). That is the real reason of the reported problem.

I will offer a patch that will change the default value of 'PersistentConnection', depending on the configured read interval.

rpv-tomsk commented 8 years ago

Created #1701 as a solution for this issue.

mfournier commented 8 years ago

I believe @rpv-tomsk's patches from #1701 fixed this problem. Please test, and complain here if you still encounter problems !