Spec for obmd integration

zenhack commented 6 years ago

This is a proposal for the HIL side of the work being discussed in #417. @okrieg, thanks for the input re: resource management.

//cc @naved001, @Izhmash

zenhack commented 6 years ago

Quoting Naved Ansari (2017-11-29 13:39:23)

1. If we have only one obmd server. Then would admins register the
   node's ipmi details directly using obmd, and then in HIL they
   register only the obmd node label? I am assuming HIL would know
   from the config file wherever the obmd is running?

Yeah, this is exactly what I intended. I tweaked the wording to make it more explicit.

2. Just spitballing here: if we want to support multiple obmd servers,
   then maybe we could think of obmd servers as switches in HIL. We
   register an obmd server with it's address and admin token. Nodes'
   ipmi details will be fed into the obmd and then our nodes become
   "ports" on the obmd server.

This might not be a bad idea. My initial plan was to just have the full URL to a node & the admin token be part of node_register, so HIL isn't even really aware of whether two nodes are managed by the same obmd. This is nice in some regards, because there's less coupling, but you're also duplicating information.

3. How will end users interact with obmd? Will we be changing all of
   HIL's ipmi related APIs (power cycle, power off etc)?

Per the summary: "All other obm related calls will be changed to simply proxy the corresponding calls on obmd.

So e.g. power_cycle would just check the authorization requirements and then make a power_cycle call to obmd. Note that except for the console stuff, obmd's api is a copy of HIL's, so except for the bit about obms possibly being disabled, only the console API changes -- from the user's perspective, the rest of the API is the same.

4. Database: Will the obmd have it's own database? or will it talk to
   the same database that HIL uses (different table perhaps?)

It has its own database. Right now it's hard-coded as SQLite, but it would be trivial to add e.g. postgres support, which probably makes sense. In that case, we could still use the same database server, just using different database.

If you have thoughts on how to clarify the descriptions, let me know.

zenhack commented 6 years ago

@naved001, tangentially, could you test the refactor branch of obmd against an actual server? I think it should solve the race conditions we were seeing before.

zenhack commented 6 years ago

Ack, hold off on that, unrelated bug that needs working out.

zenhack commented 6 years ago

@naved001, ok, bug fixed and the stuff has been pushed to master. Please test.

naved001 commented 6 years ago

cool, will test it tomorrow.

zenhack commented 6 years ago

Quoting Ian Ballou (2017-12-01 14:56:34)

Why are you replacing the token during enable_obm rather than telling the user that the obm is already enabled? Would there be a case when a token could be invalidated and need replacing?

obmd doesn't persist tokens to disk, so if it crashes they're gone. More generally, having it be idempotent means a user doesn't really need to keep track of what state it's in, they can just call enable/disable if they're not sure.

naved001 commented 6 years ago

ok, bug fixed and the stuff has been pushed to master. Please test.

I've been busy setting up some infra stuff; will test it soon and will leave a more careful review on this one.

zenhack commented 6 years ago

Ok, thanks for keeping me in the loop.

naved001 commented 6 years ago

@zenhack What I described earlier is still there.

Here's something that might be helpful about SOL. I have tested this on cisco nodes in kaizen cluster.

To cleanly terminate an SOL session send it the terminate sequence described here which is ~. by default. The tilde character can be changed to something else (since it can kill an ssh session) by specifying the -e argument when activating sol. ipmitool activate sol -e ! and then !. will kill an sol session.
Since users can't send keystrokes to the console, the code can handle it when users ^C it? This will cleanly kill the session and we won't have dangling sol sessions causing trouble.
Didn't test the other calls (power_off, cycle) those are commented out in your code. Any reason?

Also, @knikolla thinks it would be worth meeting in person to discuss the OBM dameon and the fact that a core HIL component is being written in Go. Do you think you could send out a calendar invite to @okrieg and/or @pjd-nu to see if we could all meet and discuss this? Thanks!

naved001 commented 6 years ago

Other than that, I am mostly fine with the specs proposed here.

One question, not that important though; we have a a dry run mode when we don't want to make actual calls to ipmi, how would this work here? Do we still call the obmd and not ipmitool, or don't call obmd at all. (We could, of course, just use the dummy driver)

naved001 commented 6 years ago

Also, one more thing. Is there a reason why can the driver not turn on SOL if it's disabled? It's only a one time thing and I guess HIL admins could do it. ipmitool sol set enabled true

zenhack commented 6 years ago

I'll try swapping out the call to disable with injecting those characters. If that doesn't work it might make sense to sit down together and debug it in person; the latency is killing us here.

Re: the other calls, the only reason they're commented out is that I hadn't gotten around to tweaking them to fit back into the refactored (drivertized) version, but doing so will be trivial, and I should just go ahead and do it.

Re: dry run, my instinct is to just not talk to OBMd, but I don't feel strongly.

Re: The language issue. Frankly, I'm really annoyed at this being brought up SIX MONTHS after I wrote the prototype and solicited feedback. I @-mentioned @knikolla (and others) at least 3 times, and he just didn't respond. It is not reasonable to just ignore requests for feedback and then bring up a concern like this this late into the process. Also, @knikolla has his own Github account; why is the message coming through you?

We can have that discussion, but I also want to talk about some of the flaws in our development process, because this is really not acceptable.

naved001 commented 6 years ago

why is the message coming through you?

Well, he was around me when I was typing that out. @knikolla can you address the rest of the comments?

CCI-MOC / hil

Spec for obmd integration #915