atc0005 / go-nagios

Shared Golang package for Nagios plugins
MIT License
8 stars 3 forks source link

(*nagios.ExitState).ReturnCheckResults() unintentionally masks or "swallows" panics #49

Closed atc0005 closed 3 years ago

atc0005 commented 3 years ago

While fleshing out the structure on a new Nagios plugin, I made a mistake which would have normally blown up in my face pretty quickly. Instead, the app silently exited with a successful error code. That threw me for a loop.

After more time spent debugging than should have been needed, I figured out that ReturnCheckResults was being called and since no explicit error conditions were set, the app silently exited.

Relevant code bits:

https://github.com/atc0005/go-nagios/blob/44ceaf7f64a2342b8e046ee4283d652a1072b9a3/nagios.go#L133

At this particular point ServiceOutput wasn't set, nor was BrandingCallback or anything else that would cause this package to emit output.

I'm going to need to give this some thought, but the workaround I'm employing for the moment in the new plugin is to replace this:

func main() {

    // Set initial "state" as valid, adjust as we go.
    var nagiosExitState = nagios.ExitState{
        LastError:      nil,
        ExitStatusCode: nagios.StateOKExitCode,
    }

    // defer this from the start so it is the last deferred function to run
    defer nagiosExitState.ReturnCheckResults()

    // ...

with this (or some variation of it):

func main() {

    // Set initial "state" as valid, adjust as we go.
    var nagiosExitState = nagios.ExitState{
        LastError:      nil,
        ExitStatusCode: nagios.StateOKExitCode,
    }

    // defer this from the start so it is the last deferred function to run
    defer func() {
        // block ReturnCheckResults (v0.5.1 of pkg) from "swallowing"
        // potential panics
        if err := recover(); err != nil {
            panic(err)
        }
        nagiosExitState.ReturnCheckResults()
    }()

    // ...
atc0005 commented 3 years ago

One thing that comes to mind is adding this small bit:

if err := recover(); err != nil {
    panic(err)
}

at either the beginning of the method or near the end before os.Exit(es.ExitStatusCode). If the latter, this allows the Nagios plugin to emit specific error/status output that could be useful in diagnosing the problem along with the panic output, whereas if we only emit the panic output we won't see those details.

You could also argue that handling panics is an client-specific concern, but since the current version of this package advocates deferring ReturnCheckResults() first, we're (potentially) robbing users of the ability to more easily catch/handle potential panics.

So, we could recommend deferring ReturnCheckResults() second after deferring an anonymous function call to catch panics, but that overcomplicates the usage of this package. So, I'm left with the idea that there are two viable options:

atc0005 commented 3 years ago

One thing that comes to mind is adding this small bit:

if err := recover(); err != nil {
  panic(err)
}

at either the beginning of the method or near the end before os.Exit(es.ExitStatusCode). If the latter, this allows the Nagios plugin to emit specific error/status output that could be useful in diagnosing the problem along with the panic output, whereas if we only emit the panic output we won't see those details.

Regardless, I'll need to make sure that the return code is significant enough to raise attention and that the panic event is clearly noted. While the plugin developer may be able to figure out the panic cause easily enough, the sysadmin using the plugin isn't likely to be concerned with the internals. If something is causing a panic, they'll need to know about it so that it can be reported to the plugin developer.

This said, let's assume that the plugin using this package detects a valid problem and attempts to report that. As part of that report attempt a panic is triggered. What is the best way to report both problems?

atc0005 commented 3 years ago

One approach is to feather the panic as an additional "error" entry in that output block in the Long Service Output section. Another approach is to generate a new section and clearly mark the output as a problem that should be reported to the plugin author.