coreos / container-linux-update-operator

A Kubernetes operator to manage updates of Container Linux by CoreOS
Apache License 2.0
209 stars 49 forks source link

Feature/reboot window #170

Closed mkoerperriag closed 6 years ago

mkoerperriag commented 6 years ago

Implemented support for configuring a reboot window, using timeutil/periodic from locksmith. Would be glad if you would review/merge this.

coreosbot commented 6 years ago

Can one of the admins verify this patch?

sdemos commented 6 years ago

Thanks for the pr! Things have been a little busy this week, so I haven't had the time to take a look at this as much as I would like to.

I do have one quick question though before I look closer at it. Why did you decide to bump the waiting time period in the agent? Logging an error every 24 hours is a little messy, but it just loops around and watches again anyway. It already has the behavior of restarting this watching period while it's waiting for update_engine to request a reboot anyway, so I don't see the need for an unrelated change here, especially since the periodic library already doesn't allow periods longer than 24 hours.

Also, sorry the jenkins build failed before, it was acting up earlier this week, I reran the job manually and it passed.

mkoerperriag commented 6 years ago

I decided to bump it, because the reboot window may be on a weekly basis. If you think its unnecessary, I'm completely comfortable with it.

sdemos commented 6 years ago

yeah, I think we should just leave it the way it is for now, it needs fixing up in other ways anyway.

Everything else in this pr seems okay to me. @dghubble do you have any comments on it?

dghubble commented 6 years ago

No explicit problems. I'm rather unhappy to add features given the state of the repo and lack of tests.

dghubble commented 6 years ago

Would be nicer to add docs akin to those locksmith provides. https://github.com/coreos/locksmith#reboot-windows

pablokbs commented 6 years ago

@mkoerperriag Hello, I'm trying your changes but I can't make it work, should I add the window annotations to the nodes with the agent? As annotations?

kubectl annotate nodes ip-10-80-25-82.ec2.internal "container-linux-update.v1.coreos.com/reboot-window-start=Mon 22:00" --overwrite
kubectl annotate nodes ip-10-80-25-82.ec2.internal "container-linux-update.v1.coreos.com/reboot-window-lenght=10m" --overwrite

I added that while being outside that window, and the node got rebooted anyway, I don't see anything in the logs:

Operator

container-linux-update-operator-f8b4469d5-hbk2b update-operator I0226 22:44:46.409525       1 operator.go:504] Found 1 nodes that need a reboot
container-linux-update-operator-f8b4469d5-hbk2b update-operator I0226 22:45:27.012342       1 operator.go:537] Found 0 rebooted nodes
container-linux-update-operator-f8b4469d5-hbk2b update-operator I0226 22:45:27.809136       1 operator.go:479] Found node "ip-10-80-25-82.ec2.internal" still rebooting, waiting
container-linux-update-operator-f8b4469d5-hbk2b update-operator I0226 22:45:27.809171       1 operator.go:481] Found 1 (of max 1) rebooting nodes; waiting for completion

Agent

I0226 22:42:54.487015       1 main.go:45] /bin/update-agent running
I0226 22:42:54.487084       1 agent.go:84] Setting info labels
I0226 22:42:54.512340       1 agent.go:95] Setting annotations map[string]string{"container-linux-update.v1.coreos.com/reboot-in-progress":"false", "container-linux-update.v1.coreos.com/reboot-needed":"false"}
I0226 22:44:04.817898       1 agent.go:106] Marking node as schedulable
I0226 22:44:04.828945       1 agent.go:116] Waiting for ok-to-reboot from controller...
I0226 22:44:04.829010       1 agent.go:239] Beginning to watch update_engine status
I0226 22:44:04.829623       1 agent.go:194] Updating status
I0226 22:45:27.620963       1 agent.go:130] Setting annotations map[string]string{"container-linux-update.v1.coreos.com/reboot-in-progress":"true"}
I0226 22:45:27.632410       1 agent.go:142] Marking node as unschedulable
I0226 22:45:27.640500       1 agent.go:147] Getting pod list for deletion
I0226 22:45:27.661686       1 agent.go:156] Deleting 0 pods
I0226 22:45:27.661706       1 agent.go:180] Node drained, rebooting
sdemos commented 6 years ago

@pablokbs based on the implementation, the configuration is done through command-line flags, not annotations, so you have to modify your operator deployment with the new flags to take advantage of this behavior. Environment variables of the form UPDATE_OPERATOR_REBOOT_WINDOW_START and UPDATE_OPERATOR_REBOOT_WINDOW_LENGTH should also work, also set in the operator deployment.

@mkoerperriag sorry, I dropped the ball a little bit on this one. Can you add the documentation that @dghubble mentioned in his comments describing how the reboot windows work? I would place them under the doc/ folder in their own markdown file. Basing them on the locksmith docs (https://github.com/coreos/locksmith#reboot-windows) would be a good idea.

mkoerperriag commented 6 years ago

Thanks for your feedback so far. @sdemos I will add the docs in the next couple of days.

coreosbot commented 6 years ago

Can one of the admins verify this patch?

coreosbot commented 6 years ago

Can one of the admins verify this patch?

coreosbot commented 6 years ago

Can one of the admins verify this patch?

coreosbot commented 6 years ago

Can one of the admins verify this patch?

coreosbot commented 6 years ago

Can one of the admins verify this patch?

coreosbot commented 6 years ago

Can one of the admins verify this patch?

sdemos commented 6 years ago

I'm sorry that it's taken so long for me to get back to this! Totally my fault. Also, sorry about the bot spam, we have been having some problems with our jenkins github pr plugin.

I'm going to test this on my cluster today and make sure everything works as expected.