SteScho / manubulon-snmp

Set of Icinga/Nagios plugins to check hosts and hardware with the SNMP protocol.
GNU General Public License v2.0
73 stars 71 forks source link

SNMP retries not respected due to global timeout #76

Open meni2029 opened 4 years ago

meni2029 commented 4 years ago

Hello, I'm looking at _check_snmpstorage.pl but this most likely applies to most if not all the plugins. From the code, I see that the script is forced to end after the timeout (given in arguments, default 5):

if (defined($o_timeout)) {
    verb("Alarm in $o_timeout seconds");
    alarm($o_timeout);
}

$SIG{'ALRM'} = sub {
    print "No answer from host $o_host:$o_port\n";
    exit $ERRORS{"UNKNOWN"};
};

The snmp session has the same timeout value and a retries value of 10:

        ($session, $error) = Net::SNMP->session(
            -hostname  => $o_host,
            -version   => 2,
            -community => $o_community,
            -port      => $o_port,
            -retries   => 10,
            -timeout   => $o_timeout,
            -domain    => $o_domain
        );

From my understanding the retries can not be respected as the script will be forced to end after the first snmp attempt (same timeout for the script and the snmp)

Am I right ?

Expected Behavior

Script doesn't end before snmp retries are executed

Current Behavior

No snmp retries executed as the script will end after the snmp timeout of the first attempt

Possible Solution

One solution would be to calculate a global timeout as $o_timeout*10

if (defined($o_timeout)) {
    my $global_timeout = $o_timeout * 10;
    verb("Alarm in $global_timeout seconds");
    alarm($global_timeout);
}

$SIG{'ALRM'} = sub {
    print "No answer from host $o_host:$o_port\n";
    exit $ERRORS{"UNKNOWN"};
};

Context

On one monitored Linux host we are getting "No answer from host ip:161" from time to time.

SteScho commented 4 years ago

Hi

For your context: just increase the timeout or adjust the check intervals and the retry count in your monitoring system.

In general: I don't know why the retry option is even set for this check. In others this is missing. And yes, in terms of time, the check only makes one attempt. And I think that is enough. Checks should be done quickly. In my mind, I don't like to have to wait for the repetitions at this point.

meni2029 commented 4 years ago

Hi @SteScho, thanks for your prompt reply.

In general: My point about snmp retries is that in one run of the check there can be >10 snmp queries (depending on the number of storage partitions), and if one get lost then the whole check is failed (timeout). In the other hand I agree that checks should be done quickly, for sure 10 snmp retries is too much.

Our context: At the end I found out that our issue is not with lost snmp queries, but with a storage partition intermittently missing: i.e. when _check_snmpstorage.pl runs, a partition disappears between get index_table and get of the storage values --> expected oids missing --> timeout. As a workaround we filtered the incriminated partition, as it is not an important one anyway.

You may close this issue.

SteScho commented 4 years ago

Hi.

How often do you have that situation? If it helps, it would be conceivable to create an option for the repetitions which is set to 1 by default. That fits my opinion it should be quick, but it helps you in your special situation. And of course it maybe helpful to others, too.

On the other hand, this is a feature that is missing so far. The default of 1 does not change the check behavior, but the additional option can add value for cases like yours. Sounds good.

So feel free to create a PR - I will merge it to the code.