coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

Unit spontaneously fails and doesn't restart #1130

Closed timfpark closed 9 years ago

timfpark commented 9 years ago

I'm using CoreOS to launch a set of containers that back an application. I launch them via:

$ fleetctl start ../common/web@{1,2,3,4,5}.service

This works great initially but suddenly down the road the units are inexplicably stopped and then don't restart per the unit but are left in the failed state. Both the inexplicable stop and the failure to restart are mysteries to me because per the docs I expect the unit to restart:

● web@1.service Loaded: not-found (Reason: No such file or directory) Active: failed (Result: exit-code) since Thu 2015-02-19 15:11:58 UTC; 1h 35min ago Main PID: 10747 (code=exited, status=255)

Feb 18 23:47:04 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:04 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent" Feb 18 23:47:19 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:19 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent" Feb 18 23:47:28 web-prod-coreos-2 systemd[1]: Stopping Web Service... Feb 18 23:47:30 web-prod-coreos-2 docker[4706]: web Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: web@1.service: main process exited, code=exited, status=255/n/a Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Stopped Web Service. Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Unit web@1.service entered failed state. Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: web@1.service failed.

I'm perplexed that status returns Loaded: not-found on the failed unit because a health unit's status reports this which indicates that it does have the unit loaded:

● web@1.service - Web Service Loaded: loaded (/run/fleet/units/web@1.service; linked-runtime; vendor preset: disabled) Active: active (running) since Thu 2015-02-19 16:42:09 UTC; 10min ago Process: 17719 ExecStartPre=/usr/bin/docker pull timfpark/web:latest (code=exited, status=0/SUCCESS) Process: 17702 ExecStartPre=/usr/bin/docker rm web (code=exited, status=1/FAILURE) Process: 17695 ExecStartPre=/usr/bin/docker kill web (code=exited, status=1/FAILURE) Main PID: 17727 (docker) CGroup: /system.slice/system-web.slice/web@1.service └─17727 /usr/bin/docker run --rm --name web -p 80:8080 timfpark/web

I'm using the following unit file for this service:

[Unit] Description=Web Service After=docker.service Requires=docker.service

[Service] TimeoutStartSec=0 ExecStartPre=-/usr/bin/docker kill web ExecStartPre=-/usr/bin/docker rm web ExecStartPre=/usr/bin/docker pull timfpark/web:latest ExecStart=/usr/bin/docker run --rm --name web -p 80:8080 timfpark/web ExecStop=/usr/bin/docker stop web

[X-Fleet] X-Conflicts=web@*.service

guruvan commented 9 years ago

I'm seeing this and/or similar issues constantly on 557.2 One suggestion was to add NeedsDaemonReload=yes to the unit. - this is pretty crucial imo as this is preventing services from launching automatically on new hosts (and basically causing rolling outages of services)

bcwaldon commented 9 years ago

NeedsDaemonReload is not an option you can set in your unit files. This is a property provided directly by systemd over the DBus API.

@timfpark This could be related to #900, but I'm not ready to assume that without a little more info. Do you happen to have any logs from fleet.service around the time that your unit entered this unexpected state?

timfpark commented 9 years ago

@bcwaldon: Here you go...

Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO server.go:153: Establishing etcd connectivity Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO server.go:164: Starting server components Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO engine.go:82: Engine leadership changed from 4b148412e4374434a4b2b92655f6ef32 to 64b646305c6d46e9953486371e93cb3d Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:43 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:44 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:145: Triggered systemd unit frontdoor@1.service stop: job=1360760 Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:275: Removing systemd unit frontdoor@1.service Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=frontdoor@1.service reason="unit loaded but not scheduled here" Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:145: Triggered systemd unit newrelic-sysmond@3.service stop: job=1360841 Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:275: Removing systemd unit newrelic-sysmond@3.service Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=newrelic-sysmond@3.service reason="unit loaded but not scheduled here" Mar 02 16:06:51 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:262: Writing systemd unit frontdoor@2.service (465b) Mar 02 16:06:51 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=LoadUnit job=frontdoor@2.service reason="unit scheduled here but not loaded" Mar 02 16:06:52 frontdoor-prod-coreos-3 fleetd[1657]: ERROR manager.go:136: Failed to trigger systemd unit frontdoor@2.service start: Unit frontdoor@2.service failed to load: No such file or directory. Mar 02 16:06:52 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=frontdoor@2.service reason="unit currently loaded but desired state is launched" Mar 02 16:07:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:08:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:08:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:08:10 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:08:10 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:262: Writing systemd unit newrelic-sysmond@3.service (698b) Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=LoadUnit job=newrelic-sysmond@3.service reason="unit scheduled here but not loaded" Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:134: Triggered systemd unit newrelic-sysmond@3.service start: job=1361002 Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=newrelic-sysmond@3.service reason="unit currently loaded but desired state is launched" Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:12:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled Mar 02 16:12:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled

imjacobclark commented 9 years ago

I'm seeing similar issues regularly on CoreOS 557.2.0 where my units fail that are started with Fleet, none of which give much useful output, I see main process exited, code=exited, status=255/n/a in the service logs.

bcwaldon commented 9 years ago

@imjacobclark the first step in debugging that is to take fleet out of the equation. Can you use systemctl commands to start the unit directly?

guruvan commented 9 years ago

@bcwaldon - Sorry if I was short. It's been rough the past couple weeks... I haven't got the logs yet, still need to sanitize units, but I will push the whole lot up to github asap. however - I made some significant changes to my network this past weekend to accommodate 607.0. I've updated all the workers, but the etcd-cluster is all still on 557.2 or 584. (which seems ok so far!)

1) The rolling outages I described above were primarily caused by etcd/flannel/fleet lagging - I've moved load around, and cut down the overall amount of DNS traffic (which operates via skydns), and don't see my services dying off. This behavior, however, greatly enhanced the effects of:

2) the BUG described above seems to be caused either by

I had enough host churn as I rebuilt workers this weekend to identify that I've only got this issue with units that don't load via fleet with a small handful of instances of units

The "no such file or directory" units seem very consistent across the network. Trying to identify the log snippets you need exactly. (i.e. "egrep" is better than "egrep -v everything_else" - but not sure what I'm egrepping for) - In every case, going to the host, doing a systemctl daemon-reload && systemctl restart unit@2.app.service gets the unit running again.

While I'm sure it's not recommended - it's not clear how one would correctly edit a production unit file and submit this to fleet, without destroying all the instances of the unit together. It's not really practical to replace all instances simultaneously in production. In my best guess this is likely the primary cause in my situation. (I know I've had to do this with most, if not all the unit instances I have that are seeing this issue) - my service discovery depends on the unit instance name which names the container, and provides the structure for the DNS names, so it's not very easy for me to simply use a new unit name to achieve the result.

I don't know how to duplicate the scenario that caused this, but as soon as I have an opportunity later this week, I'm going to move some more services around, so that I can destroy all the instances and the templates for units with offending instances - I think this will cure the issue. I'll report back my findings.

Apologies - I meant this to go into #1134 - with your previous comment this weekend.

bcwaldon commented 9 years ago

@guruvan Have your issues been addressed via fleet v0.9.2?

bcwaldon commented 9 years ago

@imjacobclark any more information you can share?

alxeg commented 9 years ago

Any news on this issue?

bcwaldon commented 9 years ago

closing due to inactivity