Closed chu11 closed 8 months ago
addenum, pm --on cmm2,blade[16-23],nodes[33-47]
might be the more common way this is done, just putting the chassis and all its children on the same line. An option like --recursive
or --parents-first
may not be necessary / commonly expected.
edit: potential ipmlementation notes post #81, if ancestor of child is off/powering on and command == on/cycle, check if on/cycle currently active. After on completes, process waiters again.
additional note / question, how to deal with timeouts. The normal timeout assumes uhhh "1 round trip" of power control ... need to just increase it for large hierarchies? But then real errors will have a long timeout ... ehhh
Should this issue be closed and any discussion of how to implement #81 be moved to #81? Or am I misunderstanding what this is about?
I considered the question of "recursive" power on / off different than #81, as that issue I think only pertained to checking if the power operation is possible given the power status of the parents.
Sorry, you said
An option like --recursive or --parents-first may not be necessary / commonly expected.
so what are we talking about here? Is this about making powerman aware of hierarchies after we get redfishpower handling a cray chassis correctly?
sorry, my wording was bad.
issue 1) do not allow power control operations to nodes that cannot be power controlled due to parents/ancestors being off. Due to lack of "hierarchy" awareness, power control within these hierarchies is very difficult. This is what I feel #81 is about.
issue 2) should redfishpower "handle parent dependencies" when powering on targets. i.e. pm --on cmm,blades[0-7],Node[0-15]
, should redfishpower know to power on cmm first, blades second, and nodes third? Or should (in this example) users get errors for anything that is presently off when the command was issued. If behavior is dependent on a command line option, that info needs to be passed from powerman client to redfishpower.
i feel this issue is about issue 2
My bad, I am forgetting stuff that we already discussed.
Maybe we should refocus this issue description and title on redfishpower then. We don't have a plan on the table for making powerman hierarchy aware, only redfishpower for now. (I mean we can have an aspirational issue for powerman but I don't see that as on the critical path right now).
On the topic of 2) should reffishpower handle parent dependencies, yes? However, maybe it makes sense to implement this in stages where we get errors now, and power sequencing later? Then we just make the powerman timeout really big and let redfishpower manage the individual timeouts?
On the topic of 2) should reffishpower handle parent dependencies, yes? However, maybe it makes sense to implement this in stages where we get errors now, and power sequencing later? Then we just make the powerman timeout really big and let redfishpower manage the individual timeouts?
I pinged the admins about it, it's definitely a lower priority as the common case will actually be pm --on cmm
, pm --on blades
, pm --on nodes
The main things I think about it when supporting it:
the major cons for not doing it
to me, the timeout is the big deal, b/c on occasion we would expect something bad to happen, and now the powerman timeout could be like 5 minutes or something.
Being semi-smart about it is a good start! I'm fine if we do the simple things first and build some confidence that we're improving life for the admins.
Began pondering this more and began having concerns of raciness of behavior, which I think is not a good experience for users.
Given some assumptions/caveats, it may not be as hard to do as I originally thought.
on case
pm --on cmm,blade0,nodes[0-1]
blade0 and nodes[0-1] can't turn on unless cmm is turned on first. I don't think this is too hard to add support for this, as it is similar to base support I already have, instead of waiting for a stat
to return on, wait for on
to return.
off case
pm --off cmm,blade0,nodes[0-1]
this case is trickier. under my prototype blade0 and nodes[0-1] will do a status check on cmm before doing their off. But since cmm doesn't have a parent dependency, the "off" can happen right away. The status check of the children races against the parent "off" (note that even if the "off" is not quite done yet, children can get a status of "unknown" ...).
An easy solution might be to simply power off only the parent, and assume all children are automatically off as a result.
cycle case
pm --cycle cmm,blade0,nodes[0-1]
if cycle is implemented as off followed by on, then rules above apply and all should be good, discounting the need for a super big timeout.
But if cycle is implemented natively, then uhhh i dunno. All status checks will be racy b/c no way to know if "on" is before or after the cycle, and children could get "unknown" status for awhile too.
Perhaps cycle is required to be off followed by on when parents are involved.
i'm going to close ... as I get further on this, i think we can collapse this into #81
A follow on to #81 would be to allow users to "recursively" power on everything a parent "owns" or to handle "parents" powering on as well. Hypothetically
pm --on --recursive chassis0
would power on the chassis first, then power on the blades/stuffs beneath it, and additional layers below that if it were necessary.pm --on --parents node87
would turn on any parents necessary to turn on node87.Admins say not the highest priority, as they can script recursive / parent stuff. Detecting / reporting errors of parents is what is more critical for the scripting.