coreos / container-linux-update-operator

A Kubernetes operator to manage updates of Container Linux by CoreOS
Apache License 2.0
209 stars 49 forks source link

segvault when updating / stuck updating #184

Open johanneswuerbach opened 6 years ago

johanneswuerbach commented 6 years ago

Recently the CLUO (v0.7.0) seems to have been stuck and continuously tried to update the same node.

CoreOS: CoreOS 1800.5.0 Kubernetes: v1.9.9 Cloud: AWS us-east-1, kops 1.10

Agent logs:

I0827 23:00:28.134375       1 agent.go:184] Node drained, rebooting
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1238536]
 goroutine 33 [running]:
github.com/coreos/container-linux-update-operator/pkg/updateengine.(*Client).ReceiveStatuses(0xc4203c6660, 0xc420052480, 0xc420052300)
    /go/src/github.com/coreos/container-linux-update-operator/pkg/updateengine/client.go:99 +0x186
created by github.com/coreos/container-linux-update-operator/pkg/agent.(*Klocksmith).watchUpdateStatus
    /go/src/github.com/coreos/container-linux-update-operator/pkg/agent/agent.go:251 +0x102

Controller logs:

I0827 23:03:26.319148       1 operator.go:449] Found node "ip-10-100-24-49.ec2.internal" still rebooting, waiting
I0827 23:03:26.319172       1 operator.go:451] Found 1 (of max 1) rebooting nodes; waiting for completion
I0827 23:03:59.455065       1 operator.go:507] Found 0 rebooted nodes
I0827 23:03:59.719801       1 operator.go:449] Found node "ip-10-100-24-49.ec2.internal" still rebooting, waiting
I0827 23:03:59.720003       1 operator.go:451] Found 1 (of max 1) rebooting nodes; waiting for completion
I0827 23:04:32.719449       1 operator.go:507] Found 0 rebooted nodes
I0827 23:04:33.119047       1 operator.go:449] Found node "ip-10-100-24-49.ec2.internal" still rebooting, waiting
I0827 23:04:33.119072       1 operator.go:451] Found 1 (of max 1) rebooting nodes; waiting for completion
I0827 23:05:06.520970       1 operator.go:507] Found 0 rebooted nodes
I0827 23:05:06.918956       1 operator.go:449] Found node "ip-10-100-24-49.ec2.internal" still rebooting, waiting
I0827 23:05:06.918976       1 operator.go:451] Found 1 (of max 1) rebooting nodes; waiting for completion
I0827 23:05:39.920518       1 operator.go:507] Found 0 rebooted nodes
I0827 23:05:40.320071       1 operator.go:449] Found node "ip-10-100-24-49.ec2.internal" still rebooting, waiting
I0827 23:05:40.320094       1 operator.go:451] Found 1 (of max 1) rebooting nodes; waiting for completion
I0827 23:06:13.719273       1 operator.go:507] Found 1 rebooted nodes
I0827 23:06:14.519760       1 operator.go:449] Found node "ip-10-100-24-49.ec2.internal" still rebooting, waiting
I0827 23:06:14.519909       1 operator.go:451] Found 1 (of max 1) rebooting nodes; waiting for completion
johanneswuerbach commented 6 years ago

Looks like downgrading to v0.6.0 has solved the issue for us.

sdemos commented 6 years ago

The panic you are running into looks the same as the one in #93, which is odd because according to that issue, it should've been fixed with v0.7.0. I'll have to try and reproduce that, I don't remember much about it.

As far as not updating, the panic shouldn't have anything to do with it, the panic comes because the dbus channel gets closed underneath the watch function because the system is going down for the reboot.

Can you post the operator deployment for the failed one? Do you have a reboot window or any pre- or post-reboot hooks configured? It might also be helpful to get some of the debugging logs, which you can do by adding the flag -v 4 to the operator deployment.