eBayClassifiedsGroup / PanteraS

PanteraS - PaaS - Platform as a Service in a box
GNU General Public License v2.0
199 stars 61 forks source link

Graceful upgrade/shutdown of application #234

Closed cookandy closed 7 years ago

cookandy commented 7 years ago

Hello,

I have a simple node application deployed via marathon. I am trying to find a way to gracefully shut down the application so that no open connections are disconnected. I see @sielaq has posted some info here: https://github.com/mesosphere/marathon/issues/712

Is there any way you could describe in a bit more detail how you are handling this situation? It appears you are using a wrapper script and some extra mesos slave arguments. Is this correct?

sielaq commented 7 years ago

right. So basically we use frameworks images like: https://github.com/eBayClassifiedsGroup/PanteraS/tree/master/frameworks
that provides that functionality.

I think the best description how it works is to try step-by-step this example: https://github.com/eBayClassifiedsGroup/PanteraS/tree/master/examples/SmoothWebappPython

  1. Build the image
  2. Deploy it
  3. go inside the container like docker exec -ti 9670e6203c2c bash
  4. you should see that:
    bash-4.3# ps axuf
    PID   USER     TIME   COMMAND
    1 root       0:00 {start.sh} /bin/bash /usr/local/bin/start.sh cd /opt/web/ && python3 -m http.server --cgi
    6 root       0:00 {start.sh} /bin/bash /usr/local/bin/start.sh cd /opt/web/ && python3 -m http.server --cgi
    7 root       0:00 python3 -m http.server --cgi
    8 root       0:00 bash
    13 root       0:00 ps axuf
  5. Now check what is inside cat /usr/local/bin/start.sh, you are interested in that part:
    
    maintenance(){
    # Container name will be provided by mesos:
    MESOS_CONTAINER_NAME=${MESOS_CONTAINER_NAME:-$CONTAINER_NAME}
    # Mesos provides external ports in coma separated $PORTS
    for port in $(sed 's/,/ /g'<<<${PORTS})
    do
    # For each extenal ports you can map internal one from PORT_${int}
    port_int=$(env|sed -n "s/PORT_\([0-9]*\)=$port/\1/p")
    # registartor use ServiceID that contains variables, which are now all available
    consul_service_id="${HOST%%.*}:${MESOS_CONTAINER_NAME}:${port_int}"
    curl -X PUT "http://${HOST}:8500/v1/agent/service/maintenance/${consul_service_id}?enable=true"
    # if you use udp uncomment also the udp to switch into maintenance mode
    #curl -X PUT "http://${HOST}:8500/v1/agent/service/maintenance/${consul_service_id}:udp?enable=true"
    done
    }

trap 'maintenance && sleep 2 && kill -TERM $PID $PID_CUSTOM' TERM INT

6. BUT do not run it , run that small part instead inside container:

for port in $(sed 's/,/ /g'<<<${PORTS}) do portint=$(env|sed -n "s/PORT([0-9])=$port/\1/p") consul_service_id="${HOST%%.}:${MESOS_CONTAINER_NAME}:${port_int}" echo curl -X PUT "http://${HOST}:8500/v1/agent/service/maintenance/${consul_service_id}?enable=true" done

You should see  sth like this:

curl -X PUT http://you_host:8500/v1/agent/service/maintenance/your_consul_service_id:mesos-c8e53c87-2b5e-46b3-a706-89c42cbdb5bf-S0.d665ae48-15bd-4226-b4fa-b86e5b4bd1ac:8000?enable=true

7. run this curl few times   with `enable=true` or  `enable=false`  and verify if service is going to maintenance mode and back or not - it should be orange in consul and green when back.
 - If not verify  your DNS if `$HOST` (`your_host`) from the curl above  is resolvable inside container!
Thats very important!

8. If all fine, the trap is ready and will catch KILL signal and deregister the service (put into maintenance mode)  before it is being killed - so it will be taken out from load balance first.

9. You can play now with scale up and down, and test 

while true; do curl -H 'Host: python-smooth.service.consul' http:///cgi-bin/index; done


should always gives you back something  - without 503 

Please try and if  you have any questions just ask!
cookandy commented 7 years ago

Thanks for the reply @sielaq! Very helpful as always.

I am running into a problem with the curl command. When I run the for loop, I get back this command:

curl -X PUT http://10.134.26.172:8500/v1/agent/service/maintenance/10:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S7.d0b28555-7902-4077-8215-cb5e737bf2b7:8000?enable=true

However, when I run that curl command, I get back this error:

No service registered with ID "10:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S7.d0b28555-7902-4077-8215-cb5e737bf2b7:8000"

I can definitely resolve my ${HOST} from inside the container. Any ideas why consul would report back no registered service? Is it to do with the consul_service_id and ${HOST%%.*} only returning 10:?

cookandy commented 7 years ago

I think the problem is that when I look at consul UI, I see the service listed as:

service:my-host-name-01:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S7.d0b28555-7902-4077-8215-cb5e737bf2b7:8000?enable=true

Yet, the curl command is only returning the first octet of the IP, 10::mesos-983...., instead of the hostname.

Unfortunately using the whole IP as the service name doesn't work either, I get the No service registered with ID error.

When I run the command with the my-host-name, for example:

curl -X PUT http://10.134.26.172:8500/v1/agent/service/maintenance/my-host-name:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S7.d0b28555-7902-4077-8215-cb5e737bf2b7:8000?enable=true. I get back an empty response (which I think is good).

However, running enable=true and enable=false makes no difference.

Any ideas why the curl command is returning the IP address instead of hostname? I use the following command to start consul:

agent -client=10.134.26.172 -advertise=10.134.26.172 -bind=10.134.26.172 -data-dir=/opt/consul/data -ui -node=10.134.26.172 -dc=DC1 -domain consul -server -join=10.134.22.239 -join=10.134.23.87 -join=10.134.26.121

sielaq commented 7 years ago

curl -X PUT http://10.134.26.172:8500/v1/agent/service/maintenance/my-host-name:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S7.d0b28555-7902-4077-8215-cb5e737bf2b7:8000?enable=true. I get back an empty response (which I think is good).

However, running enable=true and enable=false makes no difference

We are very close, yes you are right that the problem is $HOST - we used to use name you use IP. How do you know that it makes no difference ? you should get empty response, but in UI you should see it marked as Service Maintenance Mode and status failing (orange in UI)

cookandy commented 7 years ago

Oh, my mistake- it does work. I thought enable=false would put it into Service Maintenance Mode, but it is actually enable=true.

sielaq commented 7 years ago

So now you need to have your own script that is using a different variable not $HOST or start using fqdn or hostname instead of IP, then you can always verify it inside container and compare it with consul UI:

consul_service_id="${HOST%%.*}:${MESOS_CONTAINER_NAME}:${port_int}"
echo $consul_service_id
sielaq commented 7 years ago

I wonder why you have an IP in $HOST ?

cookandy commented 7 years ago

So I need to figure out how to register my services in consul with the IP address, instead of the hostname, or somehow get the hostname of the machine from the IP. Where does $HOST come from?

cookandy commented 7 years ago

In my PanteraS env file I have:

CONSUL_IP=10.134.26.172
HOST_IP=10.134.26.172
sielaq commented 7 years ago

Where does $HOST come from?

it is injected by marathon to container

sielaq commented 7 years ago

or mesos , let me check...

cookandy commented 7 years ago

I think maybe it comes from the Mesos slave. BTW, I don't have any name resolution between my masters and slaves, which is why I've used IP addresses in all of my environment file. Even my MESOS_SLAVE_APP_PARAMS is configured to use --hostname=10.134.26.172.

Is there a way to register the service in Consul with the IP address, instead of hostname? Then I could just modify the ${HOST%%.*} part of the variable...

sielaq commented 7 years ago

exactly this comes from mesos-slave --hostname. I have just checked that. if you use a hostname (add into /etc/hosts so this IP will match hostname) then all should be fine.

Unfortunately registrator is using hostname, you can check with its options might be it is possible to change it

cookandy commented 7 years ago

if you use a hostname (add into /etc/hosts so this IP will match hostname) then all should be fine.

But then I'd need to map /etc/hosts to my container via marathon, correct?

If it is registrator, I would think using the -ip flag would work since the documentation says Force IP address used for registering services.

By default, when registering a service, Registrator will assign the service address by attempting to resolve the current hostname. If you would like to force the service address to be a specific address, you can specify the -ip argument.

http://gliderlabs.com/registrator/latest/user/run/#registrator-options

sielaq commented 7 years ago

nope - we already use ip flag :(

cookandy commented 7 years ago

Can you confirm that you're seeing the hostname in the ServiceID, instead of IP address?

sielaq commented 7 years ago

Yes confirm, as I said: when mesos-slave --hostname=paasslave001 I see

ServiceID": "paasslave001:mesos-361ed1a9-6409-41f0-8e39-f846582ec1a4-S8.4d287337-1ff0-4573-9ded-dc7eedefc0b8:8080",

when mesos-slave --hostname=10.0.0.100 I see IP then like:

ServiceID": "10.0.0.100:mesos-361ed1a9-6409-41f0-8e39-f846582ec1a4-S0.02eb068a-f66a-42b7-a950-e98baa848452:8080",
cookandy commented 7 years ago

Really? That's strange... when I use:

MESOS_SLAVE_APP_PARAMS=--master=zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.134.26.172 --ip=10.134.26.172 --docker_stop_timeout=5secs --gc_delay=1days --docker_socket=/tmp/docker.sock --no-systemd_enable_support --work_dir=/tmp/mesos

I still see this in consul:

"ServiceID": "paas-slave-01:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S3.146058e1-f928-49a2-86c9-43df476530f5:8000"

Is it because you are using --hostname 10.0.0.100 instead of --hostname=10.0.0.100 (missing =)?

cookandy commented 7 years ago

Actually, here is my entire env file used by the PanteraS container:

CONSUL_IP=10.134.26.172
HOST_IP=10.134.26.172
LISTEN_IP=10.134.26.172
FQDN=paas-slave-01
GOMAXPROCS=4

TYPE=slave
MASTER_COUNT=3

START_CONSUL=true
START_CONSUL_TEMPLATE=true
START_DNSMASQ=true
START_MESOS_MASTER=false
START_MARATHON=false
START_MESOS_AGENT=true
START_REGISTRATOR=true
START_ZOOKEEPER=false
START_CHRONOS=false
START_FABIO=false
START_NETDATA=false
HAPROXY_SSL=false

CONSUL_APP_PARAMS=agent -client=10.134.26.172 -advertise=10.134.26.172 -bind=10.134.26.172 -data-dir=/opt/consul/data  -ui -node=10.134.26.172 -dc=DC1 -domain consul -server  -join=10.134.22.239 -join=10.134.23.87 -join=10.134.26.121
CONSUL_DOMAIN=consul
CONSUL_TEMPLATE_APP_PARAMS=-consul=10.134.26.172:8500  -template haproxy.cfg.ctmpl:/etc/haproxy/haproxy.cfg:/opt/consul-template/haproxy_reload.sh -max-stale=0
DNSMASQ_APP_PARAMS=-d  -u dnsmasq -r /etc/resolv.conf.orig -7 /etc/dnsmasq.d --server=/consul/10.134.26.172#8600 --host-record=sf1-paas-slave-01,10.134.26.172 --address=/consul/10.134.26.172
HAPROXY_ADD_DOMAIN=
HAPROXY_CERT_OPTS=
MARATHON_APP_PARAMS=--master zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/mesos --zk zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/marathon --hostname 10.134.26.172 --no-logger --http_address 10.134.26.172 --https_address 10.134.26.172
MESOS_MASTER_APP_PARAMS=--zk=zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/mesos --work_dir=/var/lib/mesos --quorum=2 --ip=10.134.26.172 --hostname=10.134.26.172 --cluster=mesoscluster
MESOS_SLAVE_APP_PARAMS=--master=zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/mesos  --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.134.26.172 --ip=10.134.26.172 --docker_stop_timeout=5secs --gc_delay=1days --docker_socket=/tmp/docker.sock --no-systemd_enable_support --work_dir=/tmp/mesos
REGISTRATOR_APP_PARAMS=-cleanup -ip=10.134.26.172 consul://10.134.26.172:8500
ZOOKEEPER_APP_PARAMS=start-foreground
ZOOKEEPER_HOSTS=10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181
ZOOKEEPER_ID=1
KEEPALIVED_VIP=
CHRONOS_APP_PARAMS=--master zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/mesos --zk_hosts 10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181 --hostname 10.134.26.172 --http_address 10.134.26.172 --http_port 4400
FABIO_APP_PARAMS=-cfg ./fabio.properties -registry.consul.addr 10.134.26.172:8500
NETDATA_APP_PARAMS=-nd -ch /host
HOSTNAME=paas-slave-01

The only place I reference the hostname is on HOSTNAME and FQDN. Do you have an example of your env file?

sielaq commented 7 years ago

Is it because you are using --hostname 10.0.0.100 instead of --hostname=10.0.0.100 (missing =)?

aah I got it now, is it really that? let me check EDIT: naah without '=' it doesn't works at all... my previous message was not copy paste directly... you get me wrong :)

cookandy commented 7 years ago

It definitely seems to be registrator creating that entry incorrectly:

registrator stderr | 2016/11/30 20:05:50 added: 274744eaa299 paas-slave-05:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S21.c7187f91-5b2b-4ec1-b239-58065cf43616:8000
sielaq commented 7 years ago

for me it looks good. I still do not understand why you can't use real name like:

MESOS_SLAVE_APP_PARAMS=--master=zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/mesos  --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=paas-slave-01 --ip=10.134.26.172 --docker_stop_timeout=5secs --gc_delay=1days --docker_socket=/tmp/docker.sock --no-systemd_enable_support --work_dir=/tmp/mesos
cookandy commented 7 years ago

Really? When you use --hostname=10.0.0.100 you see the Consul ServiceID with the IP address?

For example:

10.0.0.100:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S21.c7187f91-5b2b-4ec1-b239-58065cf43616:8000

??

I still do not understand why you can't use real name

I have no DNS between my slaves, which means I would need to add an entry in /etc/hosts to resolve the $HOST value. I would then need to map /etc/hosts to each of my containers, which often causes problems with my apps.

When I use --hostname=10.134.26.172, my $HOST value is 10.134.26.172, but the ServiceID in Consul still somehow gets the hostname, for example:

paas-slave-01:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S21.c7187f91-5b2b-4ec1-b239-58065cf43616:8000

When I use --hostname=paas-slave-01, my $HOST value is paas-slave-01, but the curl will fail because it can't resolve the name.

I'm really struggling to figure out why you see the ServiceID with IP address, and not the hostname...

cookandy commented 7 years ago

^^ when you look at the consul web UI (or call the API with curl http://<ip>:8500/v1/catalog/service/<service>?pretty=true) you see the IP listed in the ServiceID?

Even when I use --hostname=10.134.26.172 I get a serviceID containing the hostname:

[
    {
        "Node": "10.134.26.172",
        "Address": "10.134.26.172",
        "TaggedAddresses": {
            "lan": "10.134.26.172",
            "wan": "10.134.26.172"
        },
        "ServiceID": "paas-slave-01:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S35.1387bdd3-1cae-4493-b56a-97f28209d941:8000",
        "ServiceName": "content-providers",
        "ServiceTags": [
            "content-providers",
            "haproxy",
            "haproxy_weight=100",
            "haproxy_httpchk=GET /providers/universal"
        ],
        "ServiceAddress": "10.134.26.172",
        "ServicePort": 31208,
        "ServiceEnableTagOverride": false,
        "CreateIndex": 740298,
        "ModifyIndex": 740325
    }
]

Can you confirm your registrator settings look similar to mine?

REGISTRATOR_APP_PARAMS=-cleanup -ip=10.134.26.172 consul://10.134.26.172:8500

It seems like somehow registrator is still using the hostname when it registers the service:

registrator stderr | 2016/11/30 20:05:50 added: 274744eaa299 paas-slave-05:mesos-9832c5fa-0031-4189-aa82-d9381128bb01-S21.c7187f91-5b2b-4ec1-b239-58065cf43616:8000

For what it's worth, I also tried with -ip 10.134.26.172 (no =) for REGISTRATOR_APP_PARAMS, but the service still got registered with the hostname.

sielaq commented 7 years ago

When you use --hostname=10.0.0.100 you see the Consul ServiceID with the IP address

that part I understand and we have discussed that already. I just don't understand why you can set up DNS to resolve your slaves!

cookandy commented 7 years ago

that part I understand and we have discussed that already

The part I don't understand is why you're seeing ServiceID with the IP address, and I am seeing it with the hostname. This documentation makes it sound like the ID can only contain the hostname, so I'm confused.

I just don't understand why you can set up DNS to resolve your slaves!

I didn't want to have another service to manage. If this can be done via the included DNSMasq, maybe that is a better option. But I think DNSMasq is only forwarding to Consul, so that wouldn't work - right?

sielaq commented 7 years ago

The part I don't understand

This by regitsrator design either you gonna deal with it or require a fix. Just to make sure, I see and I use hostname in ServiceID -> mesos-slave --hostname=paasslave001. The IP I only saw when I have tested mesos-slave --hostname=10.0.0.100

I didn't want to have another service to manage.

Just try the IP and hostname in /etc/hosts.

sielaq commented 7 years ago

moreover we have especially informed about that https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/generate_yml.sh#L20

cookandy commented 7 years ago

This by regitsrator design either you gonna deal with it or require a fix.

If you are seeing IP in ServiceID when using --hostname=10.0.0.100, then I shouldn't need a fix - it sounds like it works fine for you. But I have never seen IP in the ServiceID, even when using mesos-agent --hostname=<IP>.

Are you using an older version of mesos? I am using a newer version where mesos-slave has been renamed to mesos-agent

root@paas-slave-02:~# ps ax | grep mesos-agent
16262 ?        Sl     4:08 mesos-agent --master=zk://10.134.22.239:2181,10.134.23.87:2181,10.134.26.121:2181/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.134.26.178 --ip=10.134.26.178 --docker_stop_timeout=5secs --gc_delay=1days --docker_socket=/tmp/docker.sock --no-systemd_enable_support --work_dir=/tmp/mesos

Just try the IP and hostname in /etc/hosts.

This doesn't work. I can resolve the hostname just fine outside of the container, but once I get in the container, I cannot. Not unless I map /etc/hosts as an external volume. But that sometimes causes problems as I mentioned, because it overwrites the internal /etc/hosts file in the container, which contains the docker-specific host entry - for example:

172.17.0.2  503bbde45761
cookandy commented 7 years ago

I think one solution might be to configure my slaves to use itself for DNS, instead of the masters. Because DNSMasq is using --host-record=paas-slave-02,10.134.26.178, it will create a record in the local DNSMasq instance. Therefore, if I add my own slave to /etc/resolv.conf I should be able to resolve my own hostname from inside the container...

when mesos-slave --hostname=10.0.0.100 I see IP then like: ServiceID": "10.0.0.100:mesos-361ed1a9-6409-41f0-8e39-f846582ec1a4-S0.02eb068a-f66a-42b7-a950-e98baa848452:8080",

I'd really like to figure out why you're seeing the IP address in ServiceID, and I am not.

sielaq commented 7 years ago

Can you login the the running container like on the beginning

 docker exec -ti 9670e6203c2c bash

and check by env what kind of env variable exists that is the same like hostname on the native host (or contains the FQDN ? might be new mesos creates a different variables now?

cookandy commented 7 years ago

Unfortunately no such variable exists.

I kinda came up with a dirty hack by mapping /etc/hostname to /etc/hostname.orig in my marathon deploy, and then I modified your script to use a new ${HOSTNAME} variable:

maintenance(){
  # Container name will be provided by mesos:
  MESOS_CONTAINER_NAME=${MESOS_CONTAINER_NAME:-$CONTAINER_NAME}
  HOSTNAME=$(cat /etc/hostname.orig)
  # Mesos provides external ports in coma separated $PORTS
  for port in $(sed 's/,/ /g'<<<${PORTS})
  do
    # For each extenal ports you can map internal one from PORT_${int}
    port_int=$(env|sed -n "s/PORT_\([0-9]*\)=$port/\1/p")
    # registartor use ServiceID that contains variables, which are now all available
    consul_service_id="${HOSTNAME}:${MESOS_CONTAINER_NAME}:${port_int}"
    curl -X PUT "http://${HOST}:8500/v1/agent/service/maintenance/${consul_service_id}?enable=true"
    # if you use udp uncomment also the udp to switch into maintenance mode
    #curl -X PUT "http://${HOST}:8500/v1/agent/service/maintenance/${consul_service_id}:udp?enable=true"
  done
}

Thanks for the help! Things would be much easier with DNS!