QubitProducts / bamboo

HAProxy auto configuration and auto service discovery for Mesos Marathon
Apache License 2.0
794 stars 215 forks source link

Stale Haproxy processes #200

Open malterb opened 8 years ago

malterb commented 8 years ago

Hi,

I am running into the issue that I constantly get stale haproxy processes. I have tried "everything", but can't get it to work. This is my bamboo.log for an occasion, where it happened:

2016/02/17 16:42:48 Starting update loop
2016/02/17 16:42:48 Environment variable not set: MARATHON_USE_EVENT_STREAM
2016/02/17 16:42:48 Environment variable not set: STATSD_ENABLED
2016/02/17 16:42:48 bamboo_startup => 2016-02-17T16:42:48Z
2016/02/17 16:42:48 Queuing an haproxy update.
2016/02/17 16:42:48 Skipped HAProxy configuration reload due to lack of changes
2016/02/17 16:42:48 subscribe_event => 2016-02-17T16:42:49.973Z
2016/02/17 16:42:48 Queuing an haproxy update.
2016/02/17 16:42:48 Skipped HAProxy configuration reload due to lack of changes
2016/02/17 16:43:25 status_update_event => 2016-02-17T16:43:25.568Z
2016/02/17 16:43:25 Queuing an haproxy update.
2016/02/17 16:43:25 Generating validation command
2016/02/17 16:43:25 Validating config
2016/02/17 16:43:25 Exec cmd: haproxy -c -f /tmp/bamboo601755456
2016/02/17 16:43:25 Exec cmd: haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf $(cat /var/run/haproxy.pid)
2016/02/17 16:43:25 Cleaning up config
2016/02/17 16:43:25 Exec cmd:
2016/02/17 16:43:25 Reloaded HAProxy configuration
2016/02/17 16:43:27 status_update_event => 2016-02-17T16:43:28.492Z
2016/02/17 16:43:27 Queuing an haproxy update.
2016/02/17 16:43:27 Generating validation command
2016/02/17 16:43:27 Validating config
2016/02/17 16:43:27 Exec cmd: haproxy -c -f /tmp/bamboo935768479
2016/02/17 16:43:27 Exec cmd: haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf $(cat /var/run/haproxy.pid)
2016/02/17 16:43:27 Cleaning up config
2016/02/17 16:43:27 Exec cmd:
2016/02/17 16:43:27 Reloaded HAProxy configuration
2016/02/17 16:54:56 Domain mapping: Stated changed

as you can see from my processes, there are two sets of haproxies:

root@haproxy:/# ps aux | grep haproxy
haproxy  22450  0.0  0.0  25796  5212 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22327 22328 22329 22330
haproxy  22451  0.0  0.0  25816  4720 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22327 22328 22329 22330
haproxy  22452  0.0  0.0  25816  4516 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22327 22328 22329 22330
haproxy  22453  0.0  0.0  26044  5404 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22327 22328 22329 22330
haproxy  22460  0.0  0.0  25712  5020 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22450 22451 22452 22453
haproxy  22461  0.0  0.0  25852  4960 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22450 22451 22452 22453
haproxy  22462  0.0  0.0  25824  5308 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22450 22451 22452 22453
haproxy  22463  0.0  0.0  26072  5460 ?        Ss   16:43   0:00 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 22450 22451 22452 22453
root     22568  0.0  0.0  10472  2224 pts/0    S+   16:52   0:00 grep --color=auto haproxy
root@haproxy:/# cat /var/run/haproxy.pid
22460
22461
22462
22463

and I use the following config for bamboo:

  "HAProxy": {
    "TemplatePath": "/var/bamboo/haproxy_template.cfg",
    "OutputPath": "/etc/haproxy/haproxy.cfg",
    "ReloadCommand": "haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf $(cat /var/run/haproxy.pid)",
    "ReloadValidationCommand": "haproxy -c -f {{.}}"
  },

and the config parts of my haproxy.cfg:

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin
        stats timeout 30s
        user haproxy
        group haproxy
        daemon
    tune.ssl.default-dh-param 2048
    nbproc 4

defaults
        log     global
        mode    http
        option  httplog
        option forwardfor
        option dontlognull
    option forceclose
        timeout connect 5000
        timeout client  50000
        timeout server  50000

Sorry for the long post. Would've gone to SO or SF, but thought this might be an issue with bamboo.

Can anyone point me in the right direction?

hammi85 commented 8 years ago

I can report exactly the same issues. When I run my bamboo docker container in an old build, everything works fine but since I updated my container 2 days ago this is happening to me too.

A little help would be awesome :)

rasputnik commented 8 years ago

I've seen this before - haroxy constantly reloading will often cause a logjam if they happen too frequently, although Bamboo attempts to debounce reloads.

Are you constantly redeploying apps? A haproxy reload should only happen when marathons tasks move around in Mesos, causing the config to change and requiring a reload.

I fixed #177 - which caused unnecessary restarts - for the 0.2.14 release, are you using an older version?

mohamedhaleem commented 8 years ago

We typically deploy multiple times in the course of a day. This is part of CI/CD env and it happens for me with 0.2.14. Sorry about the long post..

Here is the snip in bamboo.json

"HAProxy": { "TemplatePath": "/var/bamboo/haproxy_template.cfg", "OutputPath": "/etc/haproxy/haproxy.cfg", "ReloadCommand": "/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf $(cat /var/run/haproxy.pid)", "ReloadValidationCommand": "/sbin/haproxy -c -f {{.}}" }

Before starting bamboo, here is the ps output looks like...

> ps aux | grep haproxy |grep -v grep

root 30508 0.0 0.0 46332 1724 ? Ss 17:55 0:00 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid haproxy 30509 0.0 0.0 52108 3608 ? S 17:55 0:00 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds haproxy 30510 7.5 0.0 52380 2028 ? Ss 17:55 13:03 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds

> cat /var/run/haproxy.pid

30510

At the next refresh, or app deploy we notice the following..

Bamboo logs 2016/02/20 20:51:08 Starting update loop 2016/02/20 20:51:08 bamboo_startup => 2016-02-20T20:51:08Z 2016/02/20 20:51:08 Queuing an haproxy update. 2016/02/20 20:51:08 Exec cmd: /sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf $(cat /var/run/haproxy.pid) [martini] listening on :8000 (development) 2016/02/20 20:51:08 HAProxy: Configuration updated

> ps aux | grep haproxy |grep -v grep

haproxy 30820 7.2 0.0 52072 1784 ? Ss 20:51 0:02 /sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30510

After the next refresh...

> ps aux | grep haproxy |grep -v grep

haproxy 30820 7.2 0.0 52072 1784 ? Ss 20:51 0:02 /sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30510 haproxy 30770 7.2 0.0 52072 1784 ? Ss 20:51 0:02 /sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30820

The next refresh just keeps adding to the list of haproxy processes

timoreimann commented 8 years ago

HAProxy processes are designed to live as long as there are still connections being served. Could it possibly be that you have some long-running connections still pending when the reload is initiated?

We're operating in a high-frequency deployment environment as well. For us, it's not uncommon to see 15-20 HAProxy processes being alive at the same time due to long-running WebSocket connections. They do rotate out after a few hours and get replaced by newer processes, however, which is an indication for progress. You might want to check on that behavior as well.

malterb commented 8 years ago

There shouldn't be any long-running connections to be honest (5s max). Our problem is: the stale haproxy processes still accept connections and cause 503 due to obviously now defunct instances.

timoreimann commented 8 years ago

It seems strange that HAProxy takes over so many PIDs. For us, it's only ever one PID that's passed to -sf, and the PID file never contains more than one entry either.

I'd try to figure if the PID file is populated/cleaned up properly. Are you using HAProxy natively or inside Docker?

malterb commented 8 years ago

I always get 4 because of nbproc. The issue remains even when I use nbproc 1 and hence only one PID.

mohamedhaleem commented 8 years ago

i found a similar problem others have reported with consul template / haproxy - https://github.com/hashicorp/consul-template/issues/442

Could this be similar [go] related issues?

Today we updated to 0.2.15 - and changed the reload command as follows:

"ReloadCommand": "/bin/systemctl reload haproxy"

so far, seems to be working a world better

jmprusi commented 8 years ago

EDIT: Even with grace 0s I'm having stale haproxy processes. _ I was having this issue (haproxy 1.6 inside docker), using "grace 0s" in the "defaults" section in the haproxy template conf, solves the issue.

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        grace  0s

# Template Customization
frontend http-in
        bind *:80
        {{ $services := .Services }}

Documentation: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#4.2-grace

this works for short lived connections.. if you have long running conns.. those will get killed, so perhaps you can increase the grace period... But the weird thing is.. Why haproxy keeps accepting new connections? _

malterb commented 8 years ago

Has anyone tried marathon-lb's reload command?

https://github.com/mesosphere/marathon-lb/blob/master/service/haproxy/run

#!/bin/bash
exec 2>&1
export PIDFILE="/tmp/haproxy.pid"
exec 200<$0

reload() {
  echo "Reloading haproxy"
  if ! haproxy -c -f /marathon-lb/haproxy.cfg; then
    echo "Invalid config"
    return 1
  fi
  if ! flock 200; then
    echo "Can't aquire lock, reload already in progress?"
    return
  fi

  # Begin to drop SYN packets with firewall rules
  IFS=',' read -ra ADDR <<< "$PORTS"
  for i in "${ADDR[@]}"; do
    iptables -w -I INPUT -p tcp --dport $i --syn -j DROP
  done

  # Wait to settle
  sleep 0.1

  # Save the current HAProxy state
  socat /var/run/haproxy/socket - <<< "show servers state" > /var/state/haproxy/global

  # Trigger reload
  haproxy -p $PIDFILE -f /marathon-lb/haproxy.cfg -D -sf $(cat $PIDFILE)

  # Remove the firewall rules
  IFS=',' read -ra ADDR <<< "$PORTS"
  for i in "${ADDR[@]}"; do
    iptables -w -D INPUT -p tcp --dport $i --syn -j DROP
  done

  # Need to wait 1s to prevent TCP SYN exponential backoff
  sleep 1
  flock -u 200
}

mkdir -p /var/state/haproxy
mkdir -p /var/run/haproxy

reload

trap reload SIGHUP
while true; do sleep 0.5; done
malterb commented 8 years ago

Found https://github.com/hashicorp/consul-template/issues/442 and https://github.com/golang/go/issues/13164

Could actually be related to Go. I just compiled bamboo with go 1.6 and will update this accordingly.

BTW: Another reload script that I might try if doesn't work: https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/infrastructure/haproxy_reload.sh

imrangit commented 8 years ago

@elmalto: were you able to resolve the issue with the latest Go 1.6 or did you employ a reload script?

-Imran

malterb commented 8 years ago

I have not seen this issue since upgrading to 1.6

Malte On Tue, Apr 19, 2016 at 08:16 imrangit notifications@github.com wrote:

@elmalto https://github.com/elmalto: were you able to resolve the issue with the latest Go 1.6 or did you employ a reload script?

-Imran

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/QubitProducts/bamboo/issues/200#issuecomment-211974236

mvallerie commented 8 years ago

Hey,

As stated in #206 , we had this issue before on our Mesos cluster. After migrating our docker images to Go 1.6 about 2 days ago, looks like it fixed it.

I can't confirm this since we also have a lot of long running connections, but the number of haproxy processes after 2 days seems much more reasonnable than before. I'll have another look during the next week and post again if something changes.

Thanks for having found that out anyway :).

j1n6 commented 8 years ago

Upgrade might have helped. I have a hunch that it's likely to be HAProxy itself.

Do you have any information/data about how often your deployment triggers reload?

On 13 May 2016, at 04:41, Mikaël Vallerie notifications@github.com wrote:

Hey,

As stated in #206 , we had this issue before on our Mesos cluster. After migrating our docker images to Go 1.6 about 2 days ago, looks like it fixed it.

I can't confirm this since we also have a lot of long running connections, but the number of haproxy processes after 2 days seems much more reasonnable than before. I'll have another look during the next week and post again if something changes.

Thanks for having found that out anyway :).

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

mvallerie commented 8 years ago

Hey, sure !

Usually on this cluster we get around 0 to 5 updates a day. The day it failed, we had many more (probably around 10), which resulted in something like 15+ haproxy processes on some mesos slaves.

We have one bamboo running (as a docker container) on each mesos slave. Right now, they have been up for 1 week, we had some updates on last friday but the amount of haproxy processes increased only until 6. And more important, sometimes, this amount is getting down, which wasn't the case before the upgrade.

I have a hunch that it's likely to be HAProxy itself.

My guess is you are right. We used marathon-lb before bamboo, and we also had this issue with it.

j1n6 commented 8 years ago

I suggest moving to Nginx to replace Haproxy, there's a branch that @bluepeppers has been working on that would support multiple reload destination - but it's still WIP.

mvallerie commented 8 years ago

Does nginx support TCP balancing (as Haproxy does) out of its "Plus" version ? Looks unclear to me.

I know it may work after building nginx with some extra modules. I'm just unsure about what those "extra modules" may or may not support compared to haproxy.

j1n6 commented 8 years ago

Yup, it does support it. If you are using Nginx Plus version, both TCP and UDP are supported out of the box.

If you are using open sourced, try to use this Nginx compatible fork: https://github.com/alibaba/tengine

Sent from my iPhone

On 17 May 2016, at 00:26, Mikaël Vallerie notifications@github.com wrote:

Does nginx support TCP balancing (as Haproxy does) out of its "Plus" version ? Looks unclear to me.

I know it may work after building nginx with some extra modules. I'm just unsure about it.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

mvallerie commented 7 years ago

@activars Just to let you know (we're still on haproxy), it happened again today on one of our mesos slaves. This issue now seems to happen only in very rare / specific situations (this is the only time it happened since), and is probably more related to haproxy or go, so it's probably not necessary to reopen.

According to the refs above, upgrading haproxy to the latest 1.5.x might be the way to definitely fix that out. Considering minor version upgrades shouldn't harm, I prepared a docker image including haproxy 1.5.19 (vs 1.5.8) and based on golang 1.8 (vs 1.6, well, that's not minor but let's trust the promise of compatibility, and let me know if that sounds like a terrible mistake :).

I'm going to test this during the next few days.

nagsharma32 commented 5 years ago

I still have the problem. Running haproxy:1.7.5

tcolgate commented 5 years ago

I'm very sorry for the lack of comms on this thread. We no longer run bamboo (no longer on mesos), and are not going to be able to provide ongoing maintenance. If anyone is interesting in maintaining it going forward, please raise another issue and we'll look at redirecting people to a fork.