coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

Fleet fails to start units after restart #1090

Closed yaronr closed 9 years ago

yaronr commented 9 years ago

Hi

3-node CoreOS Beta channel cluster. One node on CoreOS 522.5 (the problematic one), two on 522.4.

Last night, one of my nodes decided to upgrade its CoreOS version. cool. This morning I find that a few of the services that should run on this node are inactive/dead. For this issue's sake, I will use two of the services: https://gist.github.com/yaronr/62e70a897a5560a8cc63

weave.service 1cf0847f.../10.0.4.65 active running weave.service 57c5b6a6.../10.0.5.237 active running weave.service a3a566ba.../10.0.0.168 active running zookeeper-weave-sidekick@1.service 1cf0847f.../10.0.4.65 active running zookeeper-weave-sidekick@2.service a3a566ba.../10.0.0.168 active running zookeeper-weave-sidekick@3.service 57c5b6a6.../10.0.5.237 inactive dead zookeeper@1.service 1cf0847f.../10.0.4.65 active running zookeeper@2.service a3a566ba.../10.0.0.168 active running zookeeper@3.service 57c5b6a6.../10.0.5.237 inactive dead

registry.service is actually started by systemd and not via fleet (cloud-init), but it's also up: core@ip-10-0-5-237 ~ $ systemctl | grep registry registry.service loaded active running Custom Docker Registry

I tried digging a bit deeper: core@ip-10-0-5-237 ~ $ systemctl status zookeeper@3.service ● zookeeper@3.service - Zookeeper 3 Loaded: loaded (/run/fleet/units/zookeeper@3.service; linked-runtime) Active: inactive (dead)

Jan 13 05:20:44 ip-10-0-5-237.ec2.internal systemd[1]: Stopping Zookeeper 3... Jan 13 05:20:44 ip-10-0-5-237.ec2.internal docker[9047]: zoo3 Jan 13 05:20:44 ip-10-0-5-237.ec2.internal systemd[1]: Stopped Zookeeper 3.

core@ip-10-0-5-237 ~ $ systemctl status zookeeper-weave-sidekick@3.service ● zookeeper-weave-sidekick@3.service - zookeeper-weave-sidekick-3 service Loaded: loaded (/run/fleet/units/zookeeper-weave-sidekick@3.service; linked-runtime) Active: inactive (dead)

Interestingly, fleetctl list-unit-files gives: zookeeper@3.service 090d52d launched launched 57c5b6a6.../10.0.5.237 even though list-units shows it as inactive/dead.

Ok, so I try: fleetctl start zookeeper@3.service

Nothing changes, also systemctl status is the same (and no new logs)

sudo systemctl restart zookeeper@3.service does the trick, both unit and sidekick are started.

fleetctl shows it as 'running/active'

Question: Could this be related to the Requires dependency on a non-Fleet unit? (even though the unit IS running, it's a systemd unit and not a fleet one)

yaronr commented 9 years ago

Note. Another couple of units (also, unit+sidekick) failed the same way, and have no dependency on registry.service or any other non-fleet controlled unit, so I guess there's one less variable in the equation.

yaronr commented 9 years ago

Ok, one additional piece of information: I have another unit that's not starting, even after falling fleet start. Below is the unit's gist. etcd is up, mesos-master-1 is up (%i = 1)

[Unit] Description=%p discovery Container

Wants=etcd.service After=etcd.service

After=mesos-master@%i.service BindsTo=mesos-master@%i.service

[Service] Restart=always RestartSec=5 ExecStart=/bin/bash ....

[X-Fleet] MachineOf=mesos-master@%i.service

bcwaldon commented 9 years ago
  1. The reason your fleetctl start zookeeper@3.service appears to do nothing is due to a combination of https://github.com/coreos/fleet/issues/745 and https://github.com/coreos/fleet/issues/1025. You've told fleet that zookeeper@3.service should be launched somewhere, and it is, so a subsequent fleetctl start is a NOP.
  2. The random failures on machine startup could be due to https://github.com/coreos/fleet/issues/997. fleet may attempt to start your sidekick unit(s) first, which fail due to dependency issues. After they fail, fleet won't try to start them again (https://github.com/coreos/fleet/issues/998).

Hopefully this information helps you figure out what's going on here.

yaronr commented 9 years ago

@bcwaldon thanks for your attention.

I have another case:

Wants=etcd.service After=etcd.service

BindsTo=wordpress.service After=wordpress.service

Restart=always

Getting: -- Reboot -- Feb 10 15:01:16 ip-10-0-0-171.ec2.internal systemd[1]: Cannot add dependency job for unit wordpress-sidekick.service, ignoring: Unit wordpress-sidekick.service failed to load: No such file or directory.

Is it the same thing? (Note, I don't know how quickly after the reboot this happened)

bcwaldon commented 9 years ago

Yes, this is likely related, if the wordpress-sidekick.service unit is started before wordpress.service makes it to the filesystem.

yaronr commented 9 years ago

Just an update: I still have this issue, even on CoreOS 607.0.0 Was this supposed to be addressed in 607? if not, is there a scheduled release? This is very annoying

Thanks

ericson-cepeda commented 9 years ago

Same here: CoreOS stable (607.0.0) registrator.service c62e1ed3... failed failed skydns.service c62e1ed3... failed failed

Not even doing: sudo locksmithctl reboot.

bcwaldon commented 9 years ago

This bug should be fixed in all channels. Please share any fleet logs that demonstrate this issue if you are still experiencing it (not just log snippets, it all matters). The exact contents of unit files would be useful, too. Please read through https://github.com/coreos/fleet/issues/1158 as well, as that may be the root cause.

yaronr commented 9 years ago

@bcwaldon I think this issue should be re-opened. I have the same thing, on: 633.1

stop-destroy-start doesn't solve the problem. Note the 'file not found'

● marathon-weave-sidekick@1.service Loaded: not-found (Reason: No such file or directory) Active: inactive (dead)

Apr 12 07:34:57 localhost systemd[1]: Cannot add dependency job for unit marathon-weave-sidekick@1.service, ignoring: Unit marathon-weave-sidekick@1.service failed to load: No such file or directory.

bcwaldon commented 9 years ago

fleet v0.9.2 (available in Alpha) addresses the problem you describe above.