coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

Dead container not restarted with Restart=always in [Service] #940

Open fmpwizard opened 9 years ago

fmpwizard commented 9 years ago

Hi,

Fleet is not restarting my containers when the app/service crashes and exits with an error -1

Details:

I deployed a coreos server to DigitalOcean using this cloud-config: https://gist.github.com/fmpwizard/e764f4d30811cda50c35 I then ssh'ed into the server and created thw two .service files included in the previous link

then I do:

fleetctl submit app@.service
fleetctl submit app-discovery@.service
fleetctl list-unit-files //files are there
fleetctl load app@8080.service
fleetctl load app-discovery@8080.service
fleetctl start app@8080.service

at this point the service is runnig (it is a simple Go rest api)

I go on my browser and get normal responses from http://:8080/read?key=cnt

then I forced a crash by calling

http://:8080/crash

at this point fleetctl list-units shows both, app@8080 and app-discovery@8080 as dead when I was hoping that fleet would detect the failure and respect the

Restart=always line and restart my container.

Let me know if you need any more information. I'm submitting the ticker as suggested on this thread: https://groups.google.com/d/topic/coreos-user/l5hFFqHGPTI/discussion Thanks

in case people search for the contents of the unit files, I'll paste them here:

[Unit]
Description=Go Application that talks to etcd
After=etcd.service
After=docker.service
Requires=app-discovery@%i.service

[Service]
TimeoutStartSec=0
Restart=always
EnvironmentFile=/etc/environment
ExecStartPre=-/usr/bin/docker kill app%i
ExecStartPre=-/usr/bin/docker rm app%i
ExecStartPre=/usr/bin/docker pull fmpwizard/coreosdemo
#ExecStart=/usr/bin/docker run --name app%i -p ${COREOS_PUBLIC_IPV4}:%i:8080 fmpwizard/coreosdemo --link etcd:etcd coreosdemo
ExecStart=/usr/bin/docker run --restart='always' --name app%i -p ${COREOS_PUBLIC_IPV4}:%i:8080 fmpwizard/coreosdemo
ExecStop=/usr/bin/docker stop app%i

[X-Fleet]
X-Conflicts=app@*.service
[Unit]
Description=Announce app@%i service
BindsTo=app@%i.service

[Service]
EnvironmentFile=/etc/environment
ExecStart=/bin/sh -c "while true; do etcdctl set /announce/services/app%i ${COREOS_PUBLIC_IPV4}:%i --ttl 60; sleep 45; done"
ExecStop=/usr/bin/etcdctl rm /announce/services/app%i

[X-Fleet]
X-ConditionMachineOf=app@%i.service
fmpwizard commented 9 years ago

This may or may not be useful, but fleetctl list-unit-files shows the STATE as launched, while the service is actually dead, as shown in lift-units

core@one ~ $ fleetctl list-unit-files
UNIT                HASH    DSTATE      STATE       TARGET
app-discovery@8080.service  480eae8 loaded      loaded      841c47a2.../104.131.120.176
app@8080.service        fcb3383 launched    launched    841c47a2.../104.131.120.176
core@one ~ $ fleetctl list-units     
UNIT                MACHINE             ACTIVE      SUB
app-discovery@8080.service  841c47a2.../104.131.120.176 inactive    dead
app@8080.service        841c47a2.../104.131.120.176 inactive    dead
core@one ~ $ 
jonboulle commented 9 years ago

@fmpwizard this is certainly strange. Two initial comments:

fmpwizard commented 9 years ago

@jonboulle Thanks for the reply. 1- That link was very clear to read, thanks 2- using simply systemd the container does not come back and stays in failed mode. and I stopped using templates to reduce the number of "features" I was using.

just in case I missed anything, after I created the app.service and app-discovery.service files, I simply did:

sudo systemctl start app and the container started, I was able to send url requests and then Icrashed it and the app did not come back onlline.

Did you happen to spot anything wrong with my Unit file, I would imagine that restarting a service is a very common feature that most people use, I just don't know what else to try.

Thanks

fmpwizard commented 9 years ago

Any help on this issue would be great. Thanks.

fmpwizard commented 9 years ago

As as kept searching around, I found this post https://groups.google.com/d/topic/coreos-user/bGr-rrYeCj8/discussion about systemd-docker, which I think explains why my container is not being restarted. I was hoping it was some misconfiguration on my side, but if that link applies to my problem, then this is worse than what I was hoping.

Thanks

dbason commented 9 years ago

I think the OpenShift guys have managed to work around this with geard (https://github.com/openshift/geard). I haven't really looked into it too much it might be worth talking to them?

efuquen commented 9 years ago

I'm running into this issue as well, none of my containers ever automatically restart on unit failure, regardless what Restart= is set to. Is the current suggested work around to use systemd-docker? Is it confirmed that will resolve this issues? Looking at the last post on the linked to google groups discussion it's not clear if systemd-docker will be added to CoreOS?

bcwaldon commented 9 years ago

Possibly related: https://bugs.freedesktop.org/show_bug.cgi?id=89087

fmpwizard commented 9 years ago

systemd-docker is the current work around, but still doesn't seem it will be part of CoreOS

mclarkson commented 9 years ago

Can anyone see any issues using the following work-around?

https://github.com/docker/docker/issues/6791#issuecomment-72338100

A simpler solution then using @ibuildthecloud's systemd-docker is to start a docker container in the background in ExecStartPre via run -d container or start container and then using ExecStart=/usr/bin/docker logs -f container. This way systemd, before starting any dependent units, waits until docker run -d or docker start returns and that happens only when the container is started. Then the logs command sends the initial startup logs to systemd and journal and then continue to do so as the new logs arrive until the container stops.

With this approach one also needs to put -/usr/bin/docker stop container both to ExecStop and ExecStopPost. The latter ensures that if /usr/bin/docker logs dies before the container terminates, then systemd still stops the container. Note that by just using ExecStopPost without ExecStop one will not get the termination logs into the journal as systemctl stop will kill the logs command before ExecStopPost stops the container."

chrisfarms commented 9 years ago

I'm also hitting this issue, and I could swear this wasn't an problem several months ago. It's pretty crippling not being able to restart docker services.

I'm having to run my own mini init process to handle restarts, but I'm sure the init process will fail one day too :(

mhamrah commented 9 years ago

I'm curious if anyone has tried docker run --restart=always, pushing this responsibility to the docker layer instead of the fleet layer?

vincentheet commented 9 years ago

@mhamrah I just tried it since we have the same issue but the docker container is not restarted by docker. This probably is because of the docker stop command in my unit file under ExecStop. It tells docker to stop it and I guess this is fired when the unit enters the dead state.

ExecStop=/usr/bin/docker stop app
mhamrah commented 9 years ago

@vincentheet thanks for the tip. I'm pretty sure my issue had to do with the underlying systemd interaction and restarts. By default, the restart timeout is 100ms, and systemd throttling will prevent a service from restarting if it restarts too quickly (I think the default is 5 times in 10 seconds?).

I modified our fleet unit files to include the following, which slows down restarts and disables throttling:

[Service]
#Exec stuff
RestartSec=5s
StartLimitInterval=0
praveennous commented 9 years ago

Hi All, Could some one let me know to make container restart automatically when system boots. If we do changes to /etc/systemd/system/sshd.service as below "Restart=always", will it applicable for all containers on the host.

[Service] EnvironmentFile=-/etc/default/ssh ExecStart=/usr/sbin/sshd -D $SSHD_OPTS ExecReload=/bin/kill -HUP $MAINPID KillMode=process Restart=always

mischief commented 9 years ago

@praveennous editing sshd.service only affects sshd.

praveennous commented 9 years ago

@mischief : Thanks for info..Actually my requirement is to make two docker web containers auto start when system boots. How can i achieve this. Please help me on this.

skozin commented 9 years ago

@praveennous, put the following section in your services' unit files:

[Install]
WantedBy=multi-user.target

For example:

[Unit]
Description=my-service
BindsTo=docker.service
After=docker.service

[Service]
Type=simple
ExecStartPre=-/usr/bin/docker stop -t 15 "my-service"
ExecStartPre=-/usr/bin/docker rm -f "my-service"
ExecStart=/usr/bin/docker run --name="my-service" \
  --hostname "%H" \
  registry:5000/my-service:1.0.5
ExecStop=/usr/bin/docker stop "my-service"
TimeoutStartSec=0
Restart=on-failure

[Install]
WantedBy=multi-user.target

And then run systemctl enable my-service.service.

praveennous commented 9 years ago

Hi Nick,

Thanks for info.. am using Ubuntu 3.13.0-32-generic..i don't have systemctl..I tried this below section under /etc/init.d/hello and tried to service hello start and got below erros .Please find the details below and suggest me on this.

"memcache" is my container here...

root@ubuntu:/etc/init.d# cat hello [Unit] Description=MyApp After=docker.service Requires=docker.service [Service] TimeoutStartSec=0 ExecStartPre=-/usr/bin/docker kill memcache ExecStartPre=-/usr/bin/docker rm memcache ExecStartPre=/usr/bin/docker pull memcache ExecStart=/usr/bin/docker run --name memcache euw1-docker-registry.motortrak.com:8080/techops/memcache:latest /bin/sh -c "while true; do echo Hello World; sleep 1; done" [Install]

WantedBy=multi-user.target

Error :

root@ubuntu:/etc/init.d# service hello start /etc/init.d/hello: 1: /etc/init.d/hello: [Unit]: not found /etc/init.d/hello: 6: /etc/init.d/hello: [Service]: not found /etc/init.d/hello: 8: kill: Illegal number: memcache rm: cannot remove ‘memcache’: No such file or directory /etc/init.d/hello: 10: /etc/init.d/hello: pull: not found /etc/init.d/hello: 11: /etc/init.d/hello: run: not found /etc/init.d/hello: 13: /etc/init.d/hello: [Install]: not found

Do I need to change anything here..

Please advise me..

Thanks..

On Sun, Aug 2, 2015 at 10:01 PM, Семён notifications@github.com wrote:

@praveennous https://github.com/praveennous, put the following section in your services' unit files:

[Install] WantedBy=multi-user.target

For example:

[Unit] Description=my-service BindsTo=docker.service After=docker.service

[Service] Type=simple ExecStartPre=-/usr/bin/docker stop -t 15 "my-service" ExecStartPre=-/usr/bin/docker rm -f "my-service" ExecStart=/usr/bin/docker run --name="my-service" \ --hostname "%H" \ registry:5000/my-service:1.0.5 ExecStop=/usr/bin/docker stop "my-service" TimeoutStartSec=0 Restart=on-failure

[Install] WantedBy=multi-user.target

And then run systemctl enable my-service.service.

— Reply to this email directly or view it on GitHub https://github.com/coreos/fleet/issues/940#issuecomment-127071452.

Thanks, Praveen Kumar.A +91-8971969716

mischief commented 9 years ago

@praveennous it looks like you're not actually using fleet, or systemd. if you have an issue with fleet please file another bug.

praveennous commented 9 years ago

Hi Nick,

Can you please let me know whether this autostart docker containers can be done on below OS Versions where I want to set containers to autostart when system boots.

root@qa2:/etc/init.d# uname -a Linux qa2.motortrak.com 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux root@qa2:/etc/init.d# uname -r 3.13.0-32-generic

Also, I found some thing info under cat /etc/init.d/docker like below is it some thing related to dependency on OS Versions


see also init_is_upstart in /lib/lsb/init-functions (which isn't

available in Ubuntu 12.04, or we'd use it) if [ -x /sbin/initctl ] && /sbin/initctl version 2>/dev/null | grep -q upstart; then log_failure_msg "$DOCKER_DESC is managed via upstart, try using service $BASE $1" exit 1

fi

Thanks for again for your prompt response.

Regards, nouspraveen

On Mon, Aug 3, 2015 at 5:49 PM, Nick Owens notifications@github.com wrote:

@praveennous https://github.com/praveennous it looks like you're not actually using fleet, or systemd. if you have an issue with fleet please file another bug.

— Reply to this email directly or view it on GitHub https://github.com/coreos/fleet/issues/940#issuecomment-127329125.

Thanks, Praveen Kumar.A +91-8971969716

wuqixuan commented 8 years ago

@bcwaldon @mischief I think this is nothing about fleet, just about docker and systemd. This ticket can be closed.