Closed tsyardley closed 8 years ago
I also had some troubles to keep the correct configuration with haproxy on all 3 masters nodes.
Sometimes because I have to reboot the server (hard disk full)
Sometimes because a port is already allocated by docker-proxy after a service crashed and the new services try to reuse that port.
I found sometimes different haproxy_a.cfg/haproxy_b.cfg from a server to another but I didn't noticed these multiple haproxy_reload process.
I have to restart hard reset all the 3 masters nodes and relaunch all the application. If you know how to patch this :)
Hi,
Is this a know issue?
Actually is was a known issue, but it was related with consul-template - we had to downgraded it to ver. 0.10 https://github.com/hashicorp/consul-template/issues/442 (long story short: it was bug with signals handling in go1.5 and go1.6 https://github.com/golang/go/issues/13164 ) But this should not happen with 0.2.3 - it has a stable version before this bug came (or was never exposed) Might be we could try the newest version.
To avoid SOME issues mentioned by @kopax it is very important to detect flapping services in paas, that are constantly being re-spawned by marathon - it could caused disk full but also a lot of parallel HAproxy reloads, that could cause race conditions - but we did not have this issue in prod like that anymore, thats why you made me worried.
Anyway, I can recommend to use fabio (which is also included in this image) instead - that has much less moving parts depended on external projects. And Fabio is supported by us.
Fabio looks nice but it doesn't support TCP proxying. It's useful for most of services you can run.
Is there any update on this?
Also, is this project still being maintained?
@tsyardley It is maintained, but as you see this project is rather an integration of few other projects, as I suggested, we recommended to use fabio instead of haproxy due to lot of SPoF that can happen. Your situation exactly shows that if one process of HAproxy hangs it can caused avalanche of unwanted issues.
We are also using HAproxy in Prod, we did not have issue you have described, but we have restricted ourself to avoid flapping services - which is main reason of race any condition in HAproxy parallel reload. Due to limits of HAproxy (not reloadable config without stopping) - I we cannot eliminate the race conditions.
I can only suggest to you to use different load balancer instead, that we fully support - ebay fabio
@tsyardley I was just thinking, can you provide me consul TAGs you use for application. Might be we use differently the PaaS, consul-template generate bad code for haproxy and thats why haproxy is not re-spawnable.
@tsyardley can you provide something ?
@sielaq apologies for not replying - I have been on Holiday :sun_with_face:
I will be able to provide you with the information tomorrow
Hi @sielaq, I am working with @tsyardley and have the following update.
In the ps listing given above, the strange thing is haproxy_reload.sh. In a normal system, haproxy goes into background releasing haproxy_reload.sh. I've noticed that on our system, where there are remaining haproxy_reload scripts, the child haproxy processes have stopped (state T). This can happen as a race condition when two haproxy processes launch at the the same time. During early processing of haproxy, it sends a SIGTTOU to the previous PIDs to pause the listeners while it attempts to start its own listeners. Only if that is successful does it then set up the signal handlers (ie quite late in the processing) and detach into background releasing haproxy_reload.sh. However, when two haproxies are starting at the same time a race condition can occur. When the first has not yet set up signal handlers to handle the SIGTTOU and SIGTTIN signals, the default behaviour for these signals is to pause the process, so the process slightly ahead sends a SIGTTOU to the process slightly behind, which then stops. Once stopped, it is stopped forever unless sent a SIGCONT. See this ps listing with more flags showing the stopped processes marked by state T:
F S USER PID PPID SIZE WCHAN STIME TIME COMMAND
0 S root 758 31905 408 wait 06:38 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 1425 4724 480 wait 09:21 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 1450 1425 4984 signal 09:21 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 5036 4391 3852 3497 3250 2847 2530 2061 1338
0 S root 2619 31905 408 wait 08:24 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 4472 7080 480 wait 08:36 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 4495 4472 4984 signal 08:36 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 3139 2847 2530 2061 1338
0 S root 4724 31905 408 wait 09:20 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 6294 8895 480 wait 08:58 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 6312 6294 4984 signal 08:58 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 4277 3852 3497 3250 2847 2530 2061 1338
0 S root 7080 31905 408 wait 08:35 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
0 S root 8895 31905 408 wait 08:58 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 15485 31905 5248 ep_pol 14:13 00:00:08 /usr/sbin/haproxy_b -p /tmp/haproxy_b.pid -f /etc/haproxy/haproxy_b.cfg -sf 30529 28979 17102
1 S root 15499 31905 5276 ep_pol 14:13 00:00:08 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 17116 5147 4391 3497 3250 2847 2530 2061 1338
1 S root 18892 31905 4976 ep_pol 07:07 00:00:14 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31665
1 S root 20175 22752 480 wait 08:30 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 20192 20175 4984 signal 08:30 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 2736 2530 2061 1338
4 T root 20773 31905 4836 signal Oct03 00:00:00 /usr/sbin/haproxy_b -p /tmp/haproxy_b.pid -f /etc/haproxy/haproxy_b.cfg -sf 27152
4 S root 21267 17925 7696 futex_ 09:33 00:00:55 consul-template -consul=172.31.193.38:8500 -template haproxy.cfg.ctmpl:/etc/haproxy/haproxy.cfg:/opt/consul-template/haproxy_reload.sh -max-stale=0
0 S root 22752 31905 408 wait 08:30 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 23230 26579 480 wait 08:41 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 23247 23230 4984 signal 08:41 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 3385 3250 2847 2530 2061 1338
1 S root 25883 28852 480 wait 08:13 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 25900 25883 4984 signal 08:13 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 1950 1338
0 S root 26579 31905 408 wait 08:41 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 27848 30761 480 wait 07:56 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 27865 27848 4984 signal 07:56 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 1227
0 S root 28852 31905 408 wait 08:13 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 30669 758 476 wait 06:39 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 30681 30669 4984 signal 06:39 00:00:00 /usr/sbin/haproxy_b -p /tmp/haproxy_b.pid -f /etc/haproxy/haproxy_b.cfg -sf 30417 28979
0 S root 30761 31905 408 wait 07:56 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
1 S root 31943 2619 480 wait 08:25 00:00:00 /bin/bash /opt/consul-template/haproxy_reload.sh
4 T root 31960 31943 4984 signal 08:25 00:00:00 /usr/sbin/haproxy_a -p /tmp/haproxy_a.pid -f /etc/haproxy/haproxy_a.cfg -sf 31876 2419 2061 1338
We are exploring how to slow down consul-template so that it does not reload too quickly to hopefully avoid this race. Any advice on optimum values of the wait parameters to use?
Which version of PanteraS you run ? seems like the issue we had already with consul-template few months ago. Nevertheless I will provide more robust reload script soon with new release, to no to reload when configuration is broken.
Hi Sielaq,
We are currently using v0.2.3 - your suggested change sounds good, thanks!
0.3.0 has been released
We are using panteras (0.2.3) and deploying services to marathon. After some time our services stop talking to one another using the haproxy ports - but can using the marathon ports.
After some digging around we've found that on node where this occurs there are also multiple haproxy reload processes - to fix this these have to be killed and consul-template haproxy needs to be restarted in supervisor.
Some evidence follows:
Is this a known issue?