Warning and Critical Threshold Handling

infraweavers commented 1 year ago

Implementing a mechanism to alert based on warning and critical thresholds passed from the plugin without having to reimplement the nagios threshold format in every plugin.

Based on the implementation within the Monitoring::Plugin cpan module (https://metacpan.org/pod/Monitoring::Plugin) and the documentation (https://www.monitoring-plugins.org/doc/guidelines.html#THRESHOLDFORMAT)

As discussed in: https://github.com/atc0005/go-nagios/issues/176

This is as-used for us as we've inlined it into /internal for the time being.

A toy example of how it can be used is in the tests like:

var plugin = Plugin{
    ExitStatusCode: StateOKExitCode,
}
plugin.ServiceOutput = "CHECK-NT-REPLACEMENT"

perfdata := PerformanceData{
    Label:             "perfdata label",
    Value:             "41.0",
    UnitOfMeasurement: "C",
    Warn:              "5:30",
    Crit:              "0:40",
}
plugin.AddPerfData(false, perfdata)
plugin.EvaluateThreshold(perfdata)

So if the temperature is outside of the acceptable range, then it'll go warning or critical as expected

atc0005 commented 1 year ago

@infraweavers Thanks for taking the time to submit this PR.

I've got some other items on my plate, but I hope to dig into this before long.

atc0005 commented 1 year ago

Oops, sorry!

@infraweavers I made some CI adjustments to fix some false assumptions on my part (regarding when some workflows run, requirements, etc.); I had a workflow that looks for needed go mod tidy and go mod vendor operations which wasn't triggering. I added those changes in #179.

I had some trouble getting some other jobs to trigger and to troubleshoot, I accepted the option the GitHub PR UI gave me to rebase using the latest changes from this repo. I didn't think that through and it looks like the action modified the perfdata-alerting branch in your fork associated with your PR (this access was given by default as part of submitting the PR).

If we merge this PR we can have you rebase/force-push from your end so that the PR branch is as you originally created it (so that I'm not incorrectly attributed credit for your work).

CI failures

This all started when I was trying to troubleshoot errors like these from golangci-lint:

example_constants_test.go:28:10: undeclared name: `nagios` (typecheck)
    os.Exit(nagios.StateOKExitCode)
            ^
example_constants_test.go:34:3: undeclared name: `nagios` (typecheck)
        nagios.StateOKLabel,
        ^
example_constants_test.go:35:3: undeclared name: `nagios` (typecheck)
        nagios.CheckOutputEOL,
        ^

It looks like the inclusion of the stretchr/testify package in the added tests tripped up some of the CI checks (which assume that all dependencies are vendored).

Running go mod tidy && go mod vendor should sort that issue and allow the rest of the CI checks to run. I can perform this step and push to your fork's branch if you like. We'd still have you rebase/force-push to your branch later to resolve incorrect attribution.

Thinking out loud

Some observations and some "thinking out loud" comments below. Please correct my thinking on any points I miss or have incorrect. Since this approach (setting exit state based on perfdata values/threshold range evaluation) is new to me I expect I'm not fully seeing all of the pieces yet.

I'm not really familiar/comfortable with the stretchr/testify package, but I think I understand what the tests are doing.

Cherry-picking an arbitrary test from the collection:

    t.Run("Plugin should return exit code OK when value is within accetptable range", func(t *testing.T) {
        var plugin = Plugin{
            ExitStatusCode: StateOKExitCode,
        }
        plugin.ServiceOutput = "CHECK-NT-REPLACEMENT"

        perfdata := PerformanceData{
            Label:             "perfdata label",
            Value:             "18.0",
            UnitOfMeasurement: "C",
            Warn:              "5:30",
            Crit:              "0:40",
        }
        plugin.AddPerfData(false, perfdata)
        plugin.EvaluateThreshold(perfdata)

        assert.Equal(t, StateOKExitCode, plugin.ExitStatusCode)
    })

I see that this is an internal test (assumes access to all values in the package). Not a problem, just an observation. I have the current tests currently broken out into "unexported" and "exported" test files, which I've only recently learned doesn't follow the convention of using *_internal_test.go and *_test.go naming patterns. At some point I'll likely rename the unexported_test.go file to nagios_test.go, and the exported_test.go file to nagios_internal_test.go. Still thinking that through.

If I go that direction, the new tests would likely be in the nagios_test.go file which would import the nagios package and prefix package "resources" with nagios. Also not a problem.

When I look at the (*Plugin).EvaluateThreshold function I see that it can return an error, but doesn't do so; the function skips setting up a pointer to a Range value and eventually modifying the ExitStatusCode field unless there is a value provided for either of Crit or Warn fields in the PerformanceData value. I've not given enough time to consider whether an error should be returned.

Here:

        plugin.AddPerfData(false, perfdata)
        plugin.EvaluateThreshold(perfdata)

Both method calls operate on the same perfdata value.

The first line attempts to add PerformanceData, but without error checking. Not saying it's strictly needed, just observing. If there were errors in the provided value and the current validation were sufficient to catch them (admittedly I only stubbed it out), then the EvaluateThreshold method could (presumably) safely operate on the Plugin.perfData field instead of using the same input perfdata value (which at this point hasn't been validated and could have some basic validation problems). This might all just be to keep the test case easier to follow, but wanted to note it for discussion.

Overall, while having support for plugin threshold range logic would be very beneficial, I'm not familiar with it myself. Having a solid test collection (which you've worked to provide) and some production use of that functionality that we can reference will help provide confidence that it is working as intended.

Setting exit state

@infraweavers: Implementing a mechanism to alert based on warning and critical thresholds passed from the plugin without having to reimplement the nagios threshold format in every plugin.

This makes a lot of sense to me and it's something I wanted to tackle at some point.

I didn't consider using a PerformanceData value as a container of sorts to hold given threshold values; most of my plugins were written before I added support for performance data metrics, so I opted for a collection's based approach (for plugins which collect values for bulk review):

// ServiceStater represents a type that is capable of evaluating its overall
// state.
type ServiceStater interface {
    IsCriticalState() bool
    IsWarningState() bool
    IsOKState() bool
}

// ServiceState accepts a type capable of evaluating its status and uses those
// results to map to a compatible ServiceState value.
func ServiceState(val ServiceStater) nagios.ServiceState {
    var stateLabel string
    var stateExitCode int

    switch {
    case val.IsCriticalState():
        stateLabel = nagios.StateCRITICALLabel
        stateExitCode = nagios.StateCRITICALExitCode
    case val.IsWarningState():
        stateLabel = nagios.StateWARNINGLabel
        stateExitCode = nagios.StateWARNINGExitCode
    case val.IsOKState():
        stateLabel = nagios.StateOKLabel
        stateExitCode = nagios.StateOKExitCode
    default:
        stateLabel = nagios.StateUNKNOWNLabel
        stateExitCode = nagios.StateUNKNOWNExitCode
    }

    return nagios.ServiceState{
        Label:    stateLabel,
        ExitCode: stateExitCode,
    }
}

Not sure if there is an opportunity to combine the two ideas, or if the approach I used is too specific to bulk collections of things. Noting this here in case it sparks any ideas.

infraweavers commented 1 year ago

As far as I can see, our branch on the fork is as it was so I don't think we have any trouble there?

Do you need/want us to do anything to fix anything specific or whatever?

We're not really bothered about attribution in terms of the git commits etc, if you want to copy and paste the code and kick it about and make it more suitable etc and commit it as yourself, that's fine with us.

atc0005 commented 1 year ago

@infraweavers

Thanks for the follow-up.

As far as I can see, our branch on the fork is as it was so I don't think we have any trouble there?

The code should be the same, just the attribution due to my rebasing the branch (without realizing what I was doing).

Do you need/want us to do anything to fix anything specific or whatever?

We're not really bothered about attribution in terms of the git commits etc, if you want to copy and paste the code and kick it about and make it more suitable etc and commit it as yourself, that's fine with us.

To unstick CI, one of us can run go mod tidy && go mod vendor and either rebase or push a new commit to the PR branch.

I currently use the practice of vendoring all dependencies for projects that I maintain, so the shared CI workflows that I use expect to find vendored copies of dependencies. I can do that here and push to your branch if you want in order to allow CI to run as expected and surface any linting issues that may be present.

infraweavers commented 1 year ago

To unstick CI, one of us can run go mod tidy && go mod vendor and either rebase or push a new commit to the PR branch.

Done and pushed into the same branch

atc0005 commented 1 year ago

@infraweavers Thanks.

Please let me know if you want to take a crack at the CI linting results.

I can work in linting fixes as a follow-up PR to this one if you need to focus on other things.

infraweavers commented 1 year ago

Probably got as much of the tidying up I can easily do sorted

atc0005 commented 1 year ago

@infraweavers: Probably got as much of the tidying up I can easily do sorted

Thanks.

I hope to get some time in the next few days to look at this again.

When I do, would you rather I:

push additional commits to your PR branch
open a pull request targeting your PR branch
prepare the changes and (once ready) apply them immediately after merging this PR

infraweavers commented 1 year ago

I think "prepare the changes and (once ready) apply them immediately after merging this PR" probably makes the most sense? seems like it's the easiest option

atc0005 commented 1 year ago

@infraweavers: I think "prepare the changes and (once ready) apply them immediately after merging this PR" probably makes the most sense? seems like it's the easiest option

Fair enough. Just wanted to make sure that you didn't have any further changes you'd like to make.

As part of preparing the follow-up PR I'm working on the copyright header to include. I see that you have Codeweavers Ltd in the https://github.com/infraweavers/monitoring-agent/blob/master/LICENSE file. Would you like that same attribution here or should I use the Author value from the Git commits on this branch?

Are there any other attributions that I should reference? I see that you listed https://metacpan.org/dist/Monitoring-Plugin as an inspiration. I don't have experience adapting GPL or Artistic Licensed code to use with a MIT license; if you only looked at the existing code for inspiration I don't know enough to determine if that is sufficient to trigger clauses/requirements of the GPL or Artistic License, so just making sure.

Once you setup an import on this library will you please let me know? I'd like to list any dependent projects in the README file as a reference for anyone interested in seeing real world use cases. I'm not sure when I would use the new functionality, so linking to one of your projects would be a useful reference. I may also base a new example test on that reference.

Thanks for all of your work on this.

infraweavers commented 1 year ago

Fair enough. Just wanted to make sure that you didn't have any further changes you'd like to make.

All done I think at this point.

As part of preparing the follow-up PR I'm working on the copyright header to include. I see that you have Codeweavers Ltd in the https://github.com/infraweavers/monitoring-agent/blob/master/LICENSE file. Would you like that same attribution here or should I use the Author value from the Git commits on this branch?

Yeah Codeweavers Ltd is fine.

Are there any other attributions that I should reference? I see that you listed https://metacpan.org/dist/Monitoring-Plugin as an inspiration. I don't have experience adapting GPL or Artistic Licensed code to use with a MIT license; if you only looked at the existing code for inspiration I don't know enough to determine if that is sufficient to trigger clauses/requirements of the GPL or Artistic License, so just making sure.

We looked at it for a mechanism, however as the language is different (i.e. go vs perl etc) then it's not the same code because it can't be.

Once you setup an import on this library will you please let me know? I'd like to list any dependent projects in the README file as a reference for anyone interested in seeing real world use cases. I'm not sure when I would use the new functionality, so linking to one of your projects would be a useful reference. I may also base a new example test on that reference.

Yeah can do, it'll be: https://github.com/infraweavers/monitoring-agent-check-nt-replacement

atc0005 commented 1 year ago

Are there any other attributions that I should reference? I see that you listed https://metacpan.org/dist/Monitoring-Plugin as an inspiration. I don't have experience adapting GPL or Artistic Licensed code to use with a MIT license; if you only looked at the existing code for inspiration I don't know enough to determine if that is sufficient to trigger clauses/requirements of the GPL or Artistic License, so just making sure.

We looked at it for a mechanism, however as the language is different (i.e. go vs perl etc) then it's not the same code because it can't be.

Understood. Licensing is a weak area for me, so just wanted to be sure. Thanks for checking me on that.

Once you setup an import on this library will you please let me know? I'd like to list any dependent projects in the README file as a reference for anyone interested in seeing real world use cases. I'm not sure when I would use the new functionality, so linking to one of your projects would be a useful reference. I may also base a new example test on that reference.

Yeah can do, it'll be: https://github.com/infraweavers/monitoring-agent-check-nt-replacement

Thanks, I'll include that.

atc0005 / go-nagios