chaos / powerman

cluster power control
GNU General Public License v2.0
43 stars 19 forks source link

powerman: support new setresult directive #168

Closed chu11 closed 7 months ago

chu11 commented 7 months ago

Problem: In many cases, there is no way for a power operation (i.e. power on, but not power status) to inform powerman that an error has occurred in the operation. The user will always get a "Command completed successfully" output and exit status 0 after issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman that a power operation did not succeed. A regex can be used to determine what output is expected of a successful power operation. If any are not successful, powerman can subsequently inform the user an error has occurred, leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
    send "on\n"
    foreachnode {
        expect "([^\n:]+): ([^\n]+\n)"
        setresult $1 $2 success="^ok\n"
    }
    expect "redfishpower> "
}
script on {
    send "on %s\n"
    expect "([^\n:]+): ([^\n]+\n)"
    setresult $1 $2 success="^ok\n"
    expect "redfishpower> "
}

Side note, PR #167 and this are part of the general "cray EX" PR series, but they can be reviewed independently of each other)

chu11 commented 7 months ago

This one is not a redfish thing so it might actually be nice to see if we could use the more generic vpcd instead of redfsihpower for the testing. Maybe just use this feature in the vpcd.dev and add in a way to simulate failure with a command line option.

Yeah, I did look at vcpd and didn't fully grok its output format, so just defaulted to redfishpower. Lemme not be so lazy and look to put it into vcpd.

Edit: actually we'd need some vpc dev files to cover singlet, range, and all variants.

garlick commented 7 months ago

Edit: actually we'd need some vpc dev files to cover singlet, range, and all variants.

Any reason not to add that to the main vpc.dev?

chu11 commented 7 months ago

Any reason not to add that to the main vpc.dev?

i don't think powerman can cover all 3. skimming code real quick, if "ranged" is available, singlet will never be covered.

garlick commented 7 months ago

Tell me if I'm off base but could we change the powerman logic so if a device script defines singlet and ranged, singlet is preferred when there is one plug? If it would avoid needing to copy and slightly change device scripts in test, it might be a useful change?

chu11 commented 7 months ago

now that you mention, it that would be a minor and probably useful change.

chu11 commented 7 months ago

Tell me if I'm off base but could we change the powerman logic so if a device script defines singlet and ranged, singlet is preferred when there is one plug? If it would avoid needing to copy and slightly change device scripts in test, it might be a useful change?

Can you think of a super obvious way this can be tested? Because we go off of the power control output, there's no obvious way to discern if "all" vs "ranged" were used (or if 1 node was specified, "singlet" vs "ranged").

Edit: off the top of my head, all I could think of doing is have the "on_all" vs "on_ranged" vs "on" device scripts call different commands. But that seems like a lot of work for that testing.

garlick commented 7 months ago

Maybe have vpc dump a count of the number of times each script was called to stderr and capture it in sharness?

chu11 commented 7 months ago

re-pushed, this is now on top of PR #170

redid all of the tests under vpc, and found a mem-leak to boot.

@garlick could you skim the test changes? Since they are quite different / new.

garlick commented 7 months ago

This turned out really nice! I added a couple of comments.