eBayClassifiedsGroup / PanteraS

PanteraS - PaaS - Platform as a Service in a box
GNU General Public License v2.0
199 stars 61 forks source link

Checkpointing not working #247

Closed cookandy closed 7 years ago

cookandy commented 7 years ago

I'm still investigating, but was wondering if you've run into similar. This page says:

If the mesos-agent process on a host exits (perhaps due to a Mesos bug or because the operator kills the process while upgrading Mesos), any executors/tasks that were being managed by the mesos-agent process will continue to run. When mesos-agent is restarted, the operator can control how those old executors/tasks are handled:

  1. By default, all the executors/tasks that were being managed by the old mesos-agent process are killed.
  2. If a framework enabled checkpointing when it registered with the master, any executors belonging to that framework can reconnect to the new mesos-agent process and continue running uninterrupted.

I can see from Marathon that it starts with checkpointing enabled:

I0125 23:51:37.156141 43580 master.cpp:2500] Subscribing framework marathon with checkpointing enabled and capabilities [  ]

However, when I restart the PanteraS container on a slave host, the other (spawned) docker containers continue to run until the PanteraS container has been up for 30 seconds, and then they are restarted.

I've tried to set --strict=false on the mesos agent, but It didn't help. I thought I remember containers staying running if mesos agent was restarted, but that doesn't seem to be the case. Any ideas?

sielaq commented 7 years ago

If you have stateless applications why you bother? Try to manipulate with the stop signal that is being send to mesos slave. byt default we have chosen USR1 https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/infrastructure/supervisord.conf#L92 which fit our usecase: that we always do fresh restart. This is how our systemd's /etc/systemd/system/paas.service more-or-less looks like:

[Unit]
Description=PaaS
RequiresMountsFor=/var/spool/marathon/artifacts
After=var-spool-marathon-artifacts.mount
After=docker.service
BindsTo=docker.service
Conflicts=shutdown.target reboot.target halt.target

[Service]
EnvironmentFile=-/etc/default/panteras
WorkingDirectory=/opt/PanteraS
ExecStartPre=/bin/bash /opt/PanteraS/generate_yml.sh
ExecStartPre=-/usr/local/bin/docker-compose stop
ExecStartPre=-/usr/local/bin/docker-compose rm -f
ExecStartPre=-/bin/rm -rf /tmp/mesos/*
ExecStartPre=-/bin/rm -rf /tmp/consul/*
ExecStart=/usr/local/bin/docker-compose up --force-recreate --no-deps
ExecStop=/usr/local/bin/docker-compose stop
Restart=on-failure

[Install]
WantedBy=default.target

so when we restart - we clean all the stuff we have - we treat slaves like the can die / panic, whenever they come up - are brand new slaves.

cookandy commented 7 years ago

If you have stateless applications why you bother?

I use stateful services too (like databases)...

Thanks, I'll play around with the stop signal. Question: why did you use USR1? Are you using it to trigger some other action on the OS?

cookandy commented 7 years ago

When I change the signal to TERM or KILL, here's what I notice

Stopping and Starting (waiting 75 sec)

  1. Deploy mongoDB to 3 slaves
  2. Stop one slave (systemctl stop panteras)
    • Mesos shows agent as deactivated
    • Marathon still shows the application as Running (3 of 3 instances)
  3. After 75 sec (agent_ping_timeout * max_agent_ping_timeouts)...
    • Marathon shows waiting (2 of 3 instances)
    • Mesos shows LOST
    • The mongo container continues to run
  4. Start Panteras (systemctl start panteras)
    • Mesos registers the agent as a new agent
    • Marathon deploys a new instance of mongo
    • The old mongo container is killed and restarted

Restarting (< 75 sec)

  1. Deploy mongoDB to 3 slaves
  2. Restart one slave (systemctl restart panteras)
    • Mesos shows agent as deactivated
    • Marathon shows waiting (2 of 3 instances)
    • The mongo container continues to run
  3. PanteraS container starts and re-registers to Mesos master
    • Mesos shows agent as 're-registered'
    • Marathon shows Waiting/Delayed as it tries to start a new instance of mongo
    • The old instance of mongo continues to run, however, new instances are also trying to start on the same slave (and they FAIL because only one is allowed to run). This continues forever until the marathon job is destroyed.

In the second scenario I would expect Marathon to detect the existing mongo container and connect to it (since recover=reconnect is set in Mesos). I have also tried to set --strict=false, but I notice the same behavior.

The goal is to keep the spawned containers running, so that in the event of a PanteraS upgrade, we don't need to restart the child containers...

sielaq commented 7 years ago

hmm that might be little bit complex. mesos loose information about slaves.

  1. you can't use USR1 signal (since this try to escape services to another host, but you have to use TERM or default
  2. definitely you can't remove the mesos dir, so skip that :
    ExecStartPre=-/bin/rm -rf /tmp/mesos/*
  3. And i'm not sure about zookeeper dir - that might be also needed to store outside like we do for consul and mesos
cookandy commented 7 years ago

I'm not removing anything (other than the stopped container) on startup:

[Unit]
Description=PaaS
After=network.target docker.service
Requires=docker.service

[Service]
TimeoutStartSec=5min
ExecStartPre=-/usr/bin/docker stop paas
ExecStartPre=-/usr/bin/docker rm paas
ExecStart=/usr/bin/docker run --privileged --pid=host --name paas --env-file /etc/paas/paas.conf --network host -v /etc/resolv.conf:/etc/resolv.conf.orig -v /var/spool/marathon/artifacts/store:/var/spool/store -v /var/run/docker.sock:/tmp/docker.sock -v /var/lib/docker:/var/lib/docker -v /sys:/sys -v /tmp/mesos:/tmp/mesos:shared -v /tmp/supervisord:/tmp/supervisord -v /tmp/consul/data:/opt/consul/data -v /proc:/host/proc:ro -v /sys:/host/sys:ro -v /data/docker-auth:/opt/docker-auth -v /home/acook/supervisord.conf:/etc/supervisord.conf panteras:latest

[Install]
WantedBy=multi-user.target

I only have -v /home/acook/supervisord.conf:/etc/supervisord.conf mapped so I can play with SIGTERM stuff.

According to this article, agent recovery should be possible as long as the framework is checkpointing (which it is). I don't think keeping zookeeper stuff is needed since this is a slave, and I'm not restarting any masters in my scenario.

sielaq commented 7 years ago

So, you are not removing any dir (mesos), you use different stop SIGAL, you use --strict=false flag, and you do recover in less than 75 seconds.

If this all together is not working, I have very limited ideas:

  1. try other signals like TERM, HUP, INT, QUIT, KILL or USR2
  2. try different mesos version some older one but before 1.0 (0.28.2) or the newest one (1.1.0)
sielaq commented 7 years ago

another suggestion just came to my mind, you can also try before stopping pateras by systemctl stop panteras (which will do docker stop paas) try doing this:

docker exec -ti <panteras> bash
supervisorctl stop mesos-slave

or just :

pkill -USR1 mesos-slave

and play with

pkill -TERM mesos-slave
pkill -QUIT mesos-slave
pkill -KILL mesos-slave

So don't try to stop with docker/systemd, but manually send signal - might be that mesos gets wrong signal from mothers

cookandy commented 7 years ago

but manually send signal - might be that mesos gets wrong signal from mothers

it's definitely a problem with the signal getting passed correctly. I still haven't found the real fix, but I did find some interesting results.

the following commands work - meaning the original spawned containers continue to stay running correctly after mesos-slave rejoins the master.

pkill -KILL mesos-slave
pkill -TERM mesos-slave
docker exec panteras supervisorctl signal KILL mesos-slave

Unfortunately however, none of these commands will work as a ExecStartPre step because it requires setting autorestart=false for mesos-slave in supervisord.conf.

The following commands (surprisingly) DO NOT work:

docker kill -s KILL panteras
systemctl kill panteras.service

Also, running a normal systemctl restart panteras didn't work, even when adding stopsignal=KILL in supervisord.conf and KillSignal=SIGKILL in the systemd service. The systemd stuff might be weird because it's acting on the docker binary.

Instead of recovering the containers, the mesos slave launches another instance of the container on the host, and leaves the other one there running. If I then restart panteras again, once the mesos slave joins, the original (first instance) of the spawned container is killed, the second instance stays running, along with a new instance - leaving two instances running again.

I'm still hunting.. just wanted to update you. Really not sure why docker exec panteras supervisorctl signal KILL mesos-slave works, but stopsignal=KILL doesn't.

cookandy commented 7 years ago

Another interesting finding. using stopsignal=KILL works in this scenario (and doesn't require changing autorestart

docker exec panteras supervisorctl stop mesos-slave
docker exec panteras supervisorctl start mesos-slave

However, this doesn't seem to work 🤔

docker exec panteras supervisorctl stop mesos-slave
docker kill panteras
docker start panteras
<mesos-slave gets autostarted>

seems like docker is somehow messing things up, even though in this case the panteras container id never changed..

also, this seems to work fine:

docker exec panteras supervisorctl stop mesos-slave
docker exec panteras supervisorctl reload
<supervisord restarts>
<meso-slave starts>

meaning docker daemon seems to be the culprit...

cookandy commented 7 years ago

I think it's a problem with supervisord's pid changing. in the reload the pid stays the same.

however, changing the pid causes the issue:

docker exec panteras supervisorctl stop mesos-slave
docker exec panteras supervisorctl shutdown
< panteras stops>
docker start panteras
<mesos slave starts>

as does this:

pkill supervisord
<panteras stops>
docker start panteras
sielaq commented 7 years ago

good findings!

I think it's a problem with supervisord's pid changing

if this is a problem try to remove this: https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/docker-compose.yml.tpl#L5 pid: host so docker container will have its own PIDs table (so supervisord always gonna have pid 1) Not sure if this gonna help, but worth trying.

cookandy commented 7 years ago

good suggestion, unfortunately it didn't work... :(

sielaq commented 7 years ago

Might be supervisord + python is a problem. There is 3.3.1 supervisord version, might be worth trying - but probably this would require adapt a lot of configuration. Although I did not find any changes regarding signals handling, but might be together with switching python 3.x this will make a difference. since import signal from new python will be taken and 3.3.1 is python3 compatible so that combination may change a lot.

cookandy commented 7 years ago

I finally got this working, but not sure it will be easy to incorporate into the project. I started reading the docker containerizer documentation, which says:

The Docker containerizer supports recovering Docker containers when the agent restarts, which supports both when the agent is running in a Docker container or not.

With the --docker_mesos_image flag enabled, the Docker containerizer assumes the containerizer is running in a container itself and modifies the mechanism it recovers and launches docker containers accordingly.

In order to make this work, I had to make the following changes:

I will close this issue, but wanted to provide you the details first.

sielaq commented 7 years ago

Now I remember that option. btw. I have open the issue for the new mesos 1.1.0 your stuff will not work with MESOS_HTTP checks that was introduced. https://issues.apache.org/jira/browse/MESOS-7210

I can adapt the PanteraS so you will be able to use the stuff too

cookandy commented 7 years ago

Thanks for testing against 1.1.0, I was just headed down that road this morning!

I'll wait on upgrading until that issue is resolved.

cookandy commented 7 years ago

Question: why do you use MESOS_HTTP health checks instead of just HTTP?

sielaq commented 7 years ago

very good question.

For the same reason why consul agents are doing health checks - not the masters. Marathon health checks are not scaling properly. If you have thousands services, marathon cannot handle that.

please read those two articles:

https://mesosphere.com/blog/2017/01/05/introducing-mesos-native-health-checks-apache-mesos-part-1/ https://mesosphere.com/blog/2017/01/17/introducing-mesos-native-health-checks-apache-mesos-part-2/

cookandy commented 7 years ago

Thanks for the info! I will keep this in mind as our services continue to grow.

Do you ever have problems with consul health checks flooding your services? I can imagine if you had 100 slaves and were getting 100 health checks on your application that it might cause issues.

sielaq commented 7 years ago

No, we have no problem with consul health checks, as I said the checks are being done via agents, not masters - so it is much better scalable. But we are using one check per application's instance (ping).

cookandy commented 7 years ago

But we are using one check per application's instance (ping).

Yes, but each consul agent (panteras slave) would ping the app - correct? For example:

   "SERVICE_3000_CHECK_HTTP" : "/health",
   "SERVICE_3000_CHECK_INTERVAL" : "20s",

Causes each slave to ping the app every 20 sec. I have consul running on 15 slaves, so every 20 seconds I see 15 GET /health requests. Is that correct? Maybe I've misconfigured something?

sielaq commented 7 years ago

Hmm, why each slave should ping application that is not running on its host? lets say you have 15 slaves, but 3 instances of one app, running on different slaves.

So only 3of15 consuls gonna ping the instance that runs on its host - each (of those 3 consuls agent) gonna check only one instance. In other words: each app instance should get only one GET /health request every 20 seconds (not 3 req., nor 15).

Yes - it is not a perfect solution - but still better than masters doing it (like marathon did). Moreover you can always use TTL health checks that is the best option:

SERVICE_CHECK_TTL=30s

but requires application to notify consul agents that its alive.