Open simonvanderveldt opened 7 years ago
I'd want to know which version of fleet it is, CoreOS version too, etc.
Anyway I guess it would be probably worth tuning agent_ttl
.
See also https://github.com/coreos/fleet/issues/1334.
I'd want to know which version of fleet it is, CoreOS version too, etc. Anyway I guess it would be probably worth tuning agent_ttl. See also #1334.
@dongsupark sorry for the missing info, I've added it to the initial message.
We've already had to tune agent_ttl
because we ran into a lot of problems when it was at it's default value. It's now at 30s
.
Would fleet running out of this ttl cause it to stop units?
The exact same thing happened yesterday morning and again units got stopped, the main question we're having is that once the machines in the cluster are connected to the masters again fleet on the machines where units were scheduled stops these units before they have been (fully) started on the other machine the unit was moved to.
See below for an example:
ovember 6th 2016, 07:20:22.626 tst-1-master-10.1.21.52 INFO engine.go:256: Unscheduled Job(common@2.service) from Machine(59636889a9ed4f3ba191642fd545978e)
November 6th 2016, 07:20:22.633 tst-1-master-10.1.21.52 INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: common@2.service, MachineID: 59636889a9ed4f3ba191642fd545978e, Reason: "target Machine(59636889a9ed4f3ba191642fd545978e) went away"}
November 6th 2016, 07:20:23.092 tst-1-master-10.1.21.52 INFO engine.go:271: Scheduled Unit(common@2.service) to Machine(41a360cf46ef4a4fb674a3fae3af099e)
November 6th 2016, 07:20:23.103 tst-1-master-10.1.21.52 INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: common@2.service, MachineID: 41a360cf46ef4a4fb674a3fae3af099e, Reason: "target state launched and unit not scheduled"}
November 6th 2016, 07:20:23.900 tst-1-worker-10.1.21.87 INFO manager.go:246: Writing systemd unit common@2.service (1846b)
November 6th 2016, 07:20:24.035 tst-1-worker-10.1.21.88 INFO manager.go:138: Triggered systemd unit common@2.service stop: job=1192125
November 6th 2016, 07:20:24.077 tst-1-worker-10.1.21.88 INFO manager.go:259: Removing systemd unit common@2.service
November 6th 2016, 07:20:25.426 tst-1-worker-10.1.21.87 INFO manager.go:127: Triggered systemd unit common@2.service start: job=1195770
November 6th 2016, 07:20:25.497 tst-1-worker-10.1.21.87 INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=common@2.service reason="unit scheduled here but not loaded"
November 6th 2016, 07:20:26.361 tst-1-worker-10.1.21.87 INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=common@2.service reason="unit currently loaded but desired state is launched"
November 6th 2016, 07:20:32.133 tst-1-worker-10.1.21.88 ERROR unit_state.go:204: Failed to destroy UnitState(common@2.service) in Registry: context deadline exceeded
November 6th 2016, 07:20:32.197 tst-1-worker-10.1.21.88 INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=common@2.service reason="unit loaded but not scheduled here"```
Looking into the log and code, my guess is like so.
First, you could tune etcd_request_timeout
too, to reduce errors like "Engine leadership lost, renewal failed: context deadline exceeded"
. Of course it's still hard to figure out the real reason of the lease renewal failure, but such a tuning could be workaround.
Second, I could not understand how unscheduling tasks followed the lease renewal failure. Reading code from v0.11.5, I think now I can understand. Maybe this issue could be fixed via https://github.com/coreos/fleet/pull/1496, which was already merged, and available since v0.12. Before that PR, a monitor failure resulted in a complete shutdown + start of the entire server. OTOH, after that PR, the shutdown procedure is gracefully handled. Of course I'm not sure if the PR really fixes this issue. I'm not familiar with the code base of 0.11.x. Anyway please try to upgrade v0.12 or newer.
I believe that fleet should no longer stop units if it loses it's connection to the cluster, but that's what's seemed to have happened to us. We run a cluster of 17 machines, of which we've dedicated 3 to master duty.
We're running CoreOS 899.13.0 (because we had stability issues with the 1000 series). It's using the following versions for fleetd and etcd2
It started with a single non-master node having etcd2 connectivity issues
This is something that eventually all regular nodes showed.
We then got a new etcd2 leader election
Then there's the following that repeats 300+ times from the other 2 master nodes that weren't disconnected
And finally we see the following in fleet
It seems like the reconciler was triggered, though IMHO it shouldn't be. What could be the cause of this?