fabiolb / fabio

Consul Load-Balancing made simple
https://fabiolb.net
MIT License
7.26k stars 618 forks source link

Fabio Has Stale Routes #481

Open codyja opened 6 years ago

codyja commented 6 years ago

Hi, We've experienced and issue twice where a single fabio container in a group of fabios (running under nomad) happens to have old stale routes. Fabios is determinting routes to add to it's route table by connecting to the local Consul agent on the host, and looking for the default urlprefix- tag. The behavior I saw was some of our traffic was experiencing 502 errors. When traffic would get sent to the problematic fabio container, it would then try to deliver traffic to a container:port that didn't exist in the nomad cluster anymore. If I looked at /routes on this fabio, it didn't have the same routes as the other 9 or so fabios in this node class. Restarting the container resolves the immediate issue. We're running fabio 1.5.2.

Any ideas what may cause this? Any ideas under what condition fabio may not refresh it's content from Consul? Thanks for any information.

magiconair commented 6 years ago

My guess is that your consul cluster has issues since fabio just waits for updates from Consul and then rebuilds the routing table. You can test that by having the fabios talk to the Consul servers instead of the local agents and see if the problem goes away. I would have a look at the Consul logs of these servers to see whether there were any strange things happening.

codyja commented 6 years ago

I should also note that when I queried services against the consul agent on the host where the problematic fabio was, I was getting an accurate list of services and destinations. We did check the Consul cluster and from what we saw, couldn't find any issues occuring during this time. Also, each time, it was just 1 out of about 10 fabios in a node class, and out of about 40 or so fabios in the entire cluster that had issues. Does this info change anything?

Also, what do you recomend as far as having fabio talk to consul, talking to the local consul on each host, or talking to the Consul servers directly?

Thank you

On Mon, Apr 2, 2018 at 9:12 AM, Frank Schröder notifications@github.com wrote:

My guess is that your consul cluster has issues since fabio just waits for updates from Consul and then rebuilds the routing table. You can test that by having the fabios talk to the Consul servers instead of the local agents and see if the problem goes away. I would have a look at the Consul logs of these servers to see whether there were any strange things happening.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fabiolb/fabio/issues/481#issuecomment-377937665, or mute the thread https://github.com/notifications/unsubscribe-auth/AWCWpZExn49is8w5J2fM197-7w0KaC7Lks5tkjG2gaJpZM4TDimc .

magiconair commented 6 years ago

fabio relies on Consul to sync the state in the cluster and because of that they are all independent. They just route what Consul reports to be up.

Fabio always replaces the entire routing table whenever there is a change. You should see the services which were added and removed in the fabio logs. The normal setup is for fabio to talk to the local Consul agent and let it handle the cluster state.

If your service has more than one health check then you could be affected by #427. There is another issue with websocket connections behind a load balancer but that shows different behavior.

codyja commented 6 years ago

We ran a script across the environment to pull the services from consul, then query every fabio and compare the routes in fabio to consul services destinations. We still have 3 (now 2 after I restarted one) fabios that have stale data. Is there anything I can gather or do before I restart them? Anything you'd like to see on these 2 remaining fabios that have old data in the route table? I'd be happy to do a zoom if desired. Thanks

magiconair commented 6 years ago

Did you pull the health data?

codyja commented 6 years ago

I hit the :9998/health endpoint and it returns OK. Is there anything else I can look at? Here's from the docker logs on one host. Around 02:49:54, the log pattern seemed to change up a little, no more consul messages. After that point, the logs look the same as the do right now.

2018/04/02 02:48:24 [INFO] consul: Manual config changed to #13223716
2018/04/02 02:48:27 [INFO] consul: Health changed to #13223724
2018/04/02 02:48:28 [INFO] consul: Manual config changed to #13223725
2018/04/02 02:48:28 [INFO] consul: Manual config changed to #13223726
2018/04/02 02:48:28 [INFO] consul: Manual config changed to #13223727
2018/04/02 02:48:28 [INFO] consul: Manual config changed to #13223728
2018/04/02 02:48:29 [INFO] consul: Manual config changed to #13223729
2018/04/02 02:48:30 [INFO] consul: Health changed to #13223731
2018/04/02 02:48:30 [INFO] consul: Health changed to #13223732
2018/04/02 02:48:31 [INFO] consul: Health changed to #13223736
2018/04/02 02:48:31 [INFO] consul: Health changed to #13223739
2018/04/02 02:48:33 [INFO] consul: Manual config changed to #13223740
2018/04/02 02:48:33 [INFO] consul: Manual config changed to #13223741
2018/04/02 02:48:33 [INFO] consul: Manual config changed to #13223742
2018/04/02 02:48:33 [INFO] consul: Manual config changed to #13223743
2018/04/02 02:48:34 [INFO] consul: Manual config changed to #13223744
2018/04/02 02:48:34 [INFO] consul: Health changed to #13223745
2018/04/02 02:48:34 [INFO] consul: Health changed to #13223768
2018/04/02 02:48:35 [INFO] consul: Health changed to #13223770
2018/04/02 02:48:36 [INFO] consul: Health changed to #13223771
2018/04/02 02:48:36 [INFO] consul: Health changed to #13223772
2018/04/02 02:48:37 [INFO] consul: Health changed to #13223784
2018/04/02 02:48:38 [INFO] consul: Health changed to #13223785
2018/04/02 02:48:38 [INFO] consul: Health changed to #13223786
2018/04/02 02:48:38 [INFO] consul: Manual config changed to #13223787
2018/04/02 02:48:38 [INFO] consul: Manual config changed to #13223788
2018/04/02 02:48:38 [INFO] consul: Manual config changed to #13223789
2018/04/02 02:48:38 [INFO] consul: Manual config changed to #13223790
2018/04/02 02:48:39 [INFO] consul: Manual config changed to #13223791
2018/04/02 02:48:40 [INFO] consul: Health changed to #13223793
2018/04/02 02:48:41 [INFO] consul: Health changed to #13223794
2018/04/02 02:48:41 [INFO] consul: Health changed to #13223796
2018/04/02 02:48:41 [INFO] consul: Health changed to #13223797
2018/04/02 02:48:42 [INFO] consul: Health changed to #13223798
2018/04/02 02:48:42 [INFO] consul: Health changed to #13223802
2018/04/02 02:48:43 [INFO] consul: Health changed to #13223803
2018/04/02 02:48:43 [INFO] consul: Manual config changed to #13223804
2018/04/02 02:48:43 [INFO] consul: Manual config changed to #13223805
2018/04/02 02:48:43 [INFO] consul: Manual config changed to #13223806
2018/04/02 02:48:43 [INFO] consul: Manual config changed to #13223807
2018/04/02 02:48:44 [INFO] consul: Manual config changed to #13223808
2018/04/02 02:48:44 [INFO] consul: Health changed to #13223809
2018/04/02 02:48:46 [INFO] consul: Health changed to #13223811
2018/04/02 02:48:46 [INFO] consul: Health changed to #13223818
2018/04/02 02:48:47 [INFO] consul: Health changed to #13223819
2018/04/02 02:48:47 [INFO] consul: Health changed to #13223820
2018/04/02 02:48:47 [INFO] consul: Health changed to #13223821
2018/04/02 02:48:48 [INFO] consul: Health changed to #13223827
2018/04/02 02:48:48 [INFO] consul: Manual config changed to #13223828
2018/04/02 02:48:48 [INFO] consul: Manual config changed to #13223829
2018/04/02 02:48:48 [INFO] consul: Manual config changed to #13223830
2018/04/02 02:48:48 [INFO] consul: Manual config changed to #13223831
2018/04/02 02:48:49 [INFO] consul: Manual config changed to #13223832
2018/04/02 02:48:51 [INFO] consul: Health changed to #13223834
2018/04/02 02:48:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:48:53 [INFO] consul: Manual config changed to #13223835
2018/04/02 02:48:53 [INFO] consul: Manual config changed to #13223836
2018/04/02 02:48:53 [INFO] consul: Manual config changed to #13223837
2018/04/02 02:48:53 [INFO] consul: Manual config changed to #13223838
2018/04/02 02:48:54 [INFO] consul: Manual config changed to #13223839
2018/04/02 02:48:55 [INFO] consul: Health changed to #13223841
2018/04/02 02:48:55 [INFO] consul: Health changed to #13223846
2018/04/02 02:48:58 [INFO] consul: Health changed to #13223854
2018/04/02 02:48:58 [INFO] consul: Manual config changed to #13223862
2018/04/02 02:48:58 [INFO] consul: Manual config changed to #13223863
2018/04/02 02:48:58 [INFO] consul: Manual config changed to #13223864
2018/04/02 02:48:58 [INFO] consul: Manual config changed to #13223865
2018/04/02 02:48:59 [INFO] consul: Manual config changed to #13223866
2018/04/02 02:48:59 [INFO] consul: Health changed to #13223861
2018/04/02 02:49:02 [INFO] consul: Health changed to #13223868
2018/04/02 02:49:02 [INFO] consul: Skipping service "_nomad-task-kqo3mpf33r6zu5fr453x5sxltmsjoeg3" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:02 [INFO] consul: Skipping service "_nomad-task-i2v4zgkkhrark6on3yxgt4utmlevy2h6" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:02 [INFO] consul: Skipping service "_nomad-client-ti5iwc442lgxgolvuvtv3e65xx6jkpxh" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:02 [INFO] consul: Skipping service "_nomad-task-g4gcd4axdsybzyn4phavcij4efe4dvgq" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:03 [INFO] consul: Health changed to #13223869
2018/04/02 02:49:03 [INFO] consul: Skipping service "_nomad-task-kqo3mpf33r6zu5fr453x5sxltmsjoeg3" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:03 [INFO] consul: Skipping service "_nomad-task-i2v4zgkkhrark6on3yxgt4utmlevy2h6" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:03 [INFO] consul: Skipping service "_nomad-client-ti5iwc442lgxgolvuvtv3e65xx6jkpxh" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:03 [INFO] consul: Skipping service "_nomad-task-g4gcd4axdsybzyn4phavcij4efe4dvgq" since agent on node "server01.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:03 [INFO] consul: Manual config changed to #13223870
2018/04/02 02:49:03 [INFO] consul: Manual config changed to #13223871
2018/04/02 02:49:03 [INFO] consul: Manual config changed to #13223872
2018/04/02 02:49:04 [INFO] consul: Manual config changed to #13223873
2018/04/02 02:49:04 [INFO] consul: Manual config changed to #13223874
2018/04/02 02:49:04 [INFO] consul: Health changed to #13223875
2018/04/02 02:49:05 [INFO] consul: Health changed to #13223877
2018/04/02 02:49:05 [INFO] consul: Health changed to #13223881
2018/04/02 02:49:06 [INFO] consul: Health changed to #13223882
2018/04/02 02:49:06 [INFO] consul: Health changed to #13223883
2018/04/02 02:49:07 [INFO] consul: Health changed to #13223887
2018/04/02 02:49:08 [INFO] consul: Manual config changed to #13223888
2018/04/02 02:49:08 [INFO] consul: Manual config changed to #13223889
2018/04/02 02:49:08 [INFO] consul: Manual config changed to #13223890
2018/04/02 02:49:08 [INFO] consul: Manual config changed to #13223891
2018/04/02 02:49:09 [INFO] consul: Manual config changed to #13223892
2018/04/02 02:49:09 [INFO] consul: Health changed to #13223893
2018/04/02 02:49:09 [INFO] consul: Health changed to #13223902
2018/04/02 02:49:09 [INFO] consul: Health changed to #13223905
2018/04/02 02:49:10 [INFO] consul: Health changed to #13223907
2018/04/02 02:49:11 [INFO] consul: Health changed to #13223908
2018/04/02 02:49:11 [INFO] consul: Health changed to #13223912
2018/04/02 02:49:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:49:13 [INFO] consul: Health changed to #13223913
2018/04/02 02:49:13 [INFO] consul: Health changed to #13223917
2018/04/02 02:49:13 [INFO] consul: Manual config changed to #13223922
2018/04/02 02:49:13 [INFO] consul: Manual config changed to #13223923
2018/04/02 02:49:13 [INFO] consul: Manual config changed to #13223924
2018/04/02 02:49:13 [INFO] consul: Manual config changed to #13223925
2018/04/02 02:49:14 [INFO] consul: Health changed to #13223921
2018/04/02 02:49:14 [INFO] consul: Manual config changed to #13223926
2018/04/02 02:49:15 [INFO] consul: Health changed to #13223928
2018/04/02 02:49:15 [INFO] consul: Health changed to #13223929
2018/04/02 02:49:16 [INFO] consul: Health changed to #13223930
2018/04/02 02:49:16 [INFO] consul: Health changed to #13223931
2018/04/02 02:49:17 [INFO] consul: Health changed to #13223932
2018/04/02 02:49:18 [INFO] consul: Health changed to #13223933
2018/04/02 02:49:18 [INFO] consul: Health changed to #13223942
2018/04/02 02:49:18 [INFO] consul: Manual config changed to #13223948
2018/04/02 02:49:18 [INFO] consul: Manual config changed to #13223949
2018/04/02 02:49:18 [INFO] consul: Manual config changed to #13223951
2018/04/02 02:49:18 [INFO] consul: Manual config changed to #13223952
2018/04/02 02:49:19 [INFO] consul: Health changed to #13223950
2018/04/02 02:49:19 [INFO] consul: Manual config changed to #13223953
2018/04/02 02:49:19 [INFO] consul: Health changed to #13223954
2018/04/02 02:49:19 [INFO] consul: Health changed to #13223955
2018/04/02 02:49:20 [INFO] consul: Health changed to #13223959
2018/04/02 02:49:20 [INFO] consul: Health changed to #13223963
2018/04/02 02:49:21 [INFO] consul: Health changed to #13223964
2018/04/02 02:49:23 [INFO] consul: Manual config changed to #13223965
2018/04/02 02:49:23 [INFO] consul: Manual config changed to #13223966
2018/04/02 02:49:23 [INFO] consul: Manual config changed to #13223967
2018/04/02 02:49:23 [INFO] consul: Manual config changed to #13223968
2018/04/02 02:49:24 [INFO] consul: Manual config changed to #13223969
2018/04/02 02:49:24 [INFO] consul: Health changed to #13223970
2018/04/02 02:49:25 [INFO] consul: Health changed to #13223972
2018/04/02 02:49:26 [INFO] consul: Health changed to #13223974
2018/04/02 02:49:27 [INFO] consul: Health changed to #13223999
2018/04/02 02:49:27 [INFO] consul: Health changed to #13224000
2018/04/02 02:49:28 [INFO] consul: Manual config changed to #13224011
2018/04/02 02:49:28 [INFO] consul: Manual config changed to #13224012
2018/04/02 02:49:28 [INFO] consul: Manual config changed to #13224013
2018/04/02 02:49:28 [INFO] consul: Manual config changed to #13224014
2018/04/02 02:49:29 [INFO] consul: Manual config changed to #13224019
2018/04/02 02:49:29 [INFO] consul: Health changed to #13224018
2018/04/02 02:49:29 [INFO] consul: Health changed to #13224024
2018/04/02 02:49:30 [INFO] consul: Health changed to #13224025
2018/04/02 02:49:31 [INFO] consul: Health changed to #13224027
2018/04/02 02:49:33 [INFO] consul: Manual config changed to #13224028
2018/04/02 02:49:33 [INFO] consul: Manual config changed to #13224029
2018/04/02 02:49:33 [INFO] consul: Manual config changed to #13224030
2018/04/02 02:49:33 [INFO] consul: Manual config changed to #13224031
2018/04/02 02:49:34 [INFO] consul: Manual config changed to #13224032
2018/04/02 02:49:34 [INFO] consul: Health changed to #13224033
2018/04/02 02:49:35 [INFO] consul: Health changed to #13224038
2018/04/02 02:49:39 [INFO] consul: Manual config changed to #13224053
2018/04/02 02:49:39 [INFO] consul: Manual config changed to #13224058
2018/04/02 02:49:39 [INFO] consul: Health changed to #13224057
2018/04/02 02:49:39 [INFO] consul: Health changed to #13224059
2018/04/02 02:49:40 [INFO] consul: Health changed to #13224082
2018/04/02 02:49:40 [INFO] consul: Health changed to #13224086
2018/04/02 02:49:41 [INFO] consul: Health changed to #13224087
2018/04/02 02:49:41 [INFO] consul: Skipping service "_nomad-client-tkfk2zv5z6gnkyqsdyaeg2yl4rvn6ev3" since agent on node "server03.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:41 [INFO] consul: Skipping service "_nomad-task-v6gtjcqonomkoyp6leey2oqc4wazwjfx" since agent on node "server03.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:41 [INFO] consul: Skipping service "_nomad-task-lmknmojjyl5jgufcy3nlece4sz7nkh7z" since agent on node "server03.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:41 [INFO] consul: Skipping service "_nomad-task-m75il25sxrkxn7f4rrxtmfmtehnfynsg" since agent on node "server03.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:41 [INFO] consul: Skipping service "_nomad-task-oy7dkh3hfhnhoc4y7i5sf2khd6bg7pb7" since agent on node "server03.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:41 [INFO] consul: Skipping service "_nomad-task-4mu5ghi5jyoslzorvx3vbt5mgt5f2ivm" since agent on node "server03.domain.local" is down: Agent not live or unreachable
2018/04/02 02:49:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:49:54 [INFO] consul: Manual config changed to #13224109
2018/04/02 02:50:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:50:37 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:50:37 http: TLS handshake error from 192.168.124.77:51545: remote error: tls: unknown certificate
2018/04/02 02:50:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:51:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:51:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:52:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:52:52 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:53:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:53:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:54:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:54:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:55:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:55:37 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:55:37 http: TLS handshake error from 192.168.124.77:58153: remote error: tls: unknown certificate
2018/04/02 02:55:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:56:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:56:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:57:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:57:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:58:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:58:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 02:59:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 02:59:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 03:00:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 03:00:37 [WARN] No route for 192.168.15.244:443/
2018/04/02 03:00:37 http: TLS handshake error from 192.168.124.77:64395: remote error: tls: unknown certificate
2018/04/02 03:00:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 03:01:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 03:01:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 03:02:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 03:02:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 03:03:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 03:03:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 03:04:13 [WARN] No route for 192.168.15.244:80/
2018/04/02 03:04:51 [WARN] No route for 192.168.15.244:443/
2018/04/02 03:05:13 [WARN] No route for 192.168.15.244:80/
magiconair commented 6 years ago

I meant the health data from Consul, https://www.consul.io/api/health.html#list-checks-in-state

Fabio runs this query to get the current health data for all services from Consul and then filters out the healthy ones.

curl http://localhost:8500/v1/health/state/any

Two observations:

  1. Why is your health data changing all the time? This looks like either a flapping service or a volatile health data output, e.g. a timestamp or process id in the output of the script.
  2. I'm not sure why the Manual config is changing as well. Do you have a script that writes to the fabio/config KV key?

Again, fabio pulls its data from consul and all fabios are doing the same thing since they rely on Consul to provide the consistency. Your problem is way more likely in your Consul setup than in fabio.

codyja commented 6 years ago

Ok very interesting. I'll look into your observations.

I hit that consul health/state/any endpoint on this host where the fabio container with stale routes is running. I see my specific service with accurate dst IP:port info, and it shows passing. Also, If I hit localhost:8500/v1/catalog/service/, on this host, I get the accurate dst IP:port info. It seems like consul (cluster and local agent) both have the accurate service details, but this fabio (out of about 40) has old routes. Is there a way I can confirm or check to see if this running fabio is actively polling against the local consul agent or not? Just curious, does fabio persist the route table in memory, or to a file on disk? Thanks

magiconair commented 6 years ago

Fabio keeps routes in memory. You should look for changes to the routing table which fabio logs as delta. They should be the same on every fabio instance.

Do you have multiple health checks for a service? Do you have unique names for the service instances?

magiconair commented 6 years ago

You can control the output format of the routing table with http://fabiolb.net/ref/log.routes.format/

codyja commented 6 years ago

So I looked back through the logs on multiple fabios. I can see the '[INFO] Config updates' messages along with the route adds. What's interesting is all the fabios have the same log entries including this problem fabio container up until 02:38:48 today. Then the next Config updates message was at 02:54:42 am, this was logged on every fabio except the problem fabio container. The problem fabio container hasn't logged anymore Config update messages today, or infact any additional consul messages today.

codyja commented 6 years ago

Also, regarding the service names, we should have unique names for all services. Occasionally, when we launch a nomad job, we'll register a few different service names for the same service (we do this for fabio too).

  service {
    name = "fabio-oln-lb-a"
    tags = ["fabio"]
    port = "lb"

    check {
      type     = "tcp"
      port     = "lb"
      path     = "/"
      interval = "10s"
      timeout  = "2s"
    }

  }
  service {
    name = "fabio-oln-lb"
    tags = ["fabio"]
    port = "lb"

    check {
      type     = "tcp"
      port     = "lb"
      path     = "/"
      interval = "10s"
      timeout  = "2s"
    }

  }
  service {
    name = "fabio-oln-ui-a"
    tags = ["fabio"]
    port = "ui"

    check {
      type     = "tcp"
      port     = "ui"
      path     = "/"
      interval = "10s"
      timeout  = "2s"
    }
codyja commented 6 years ago

Before I restart the last two fabios in prod that aren't receiving updates from consul anymore, is there anything else I can capture or look at?

magiconair commented 6 years ago

Does the consul agent they are connected to still receive updates? Ideally, you want to check whether the raft log index moves forward. Trying to find a good API call to test that.

codyja commented 6 years ago

Yes, the consul agent on the host where this fabio was running was definately seeing updates. If I queried consul agent on this host, I could see the service in question, and the destinations were the correct IP:port location. I didn't look at the raft index though. If I queried host:9998/api/routes, and found the service in question, it had old ip:port locations though.

codyja commented 6 years ago

Hi @magiconair, We did a production nomad deployment this morning to change a nomad job's app container version. Out of about 24 fabios running, 3 of them are showing old routes for this updated service. I got on a zoom with Sabin with Consul support, and he looked over the environment. He doesn't believe the cause is consul directly, but thinks this is being caused by fabio. He wanted to see if it was possible to get all on a zoom to potentially look at this more. Is this something we could possibly do? Thanks

codyja commented 6 years ago

Another thing I just noticed was that yesterday, when we hit this issue on 3 fabios, we actually did 2 updated nomad job deployments at almost exactly the same time. The first job (called primary) is the one that had it's service not be updated on 3 of the 24 fabios. The second job (called secondary) was deployed later and DID have it's service updated on all fabios, including the 3 problematic ones.

codyja commented 6 years ago

Hi @magiconair , I figured out what was happening here. The local consul agent either hangs or loses connectiity to the consul cluster. Nomad logs:

Apr 9 04:53:08 hostname nomad: 2018/04/09 04:53:08.099597 [WARN] consul.checks: check "healthcheck" for task "mytaskname" alloc "d65871db-x" timed out (2s) Apr 9 04:53:13 hostname nomad: 2018/04/09 04:53:13.463818 [WARN] consul.checks: update for task "mytaskname" alloc "d65871db-x" check "healthcheck" failed: Put http://127.0.0.1:8500/v1/agent/check/update/7cd03a152b3xxxxxxxx: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Apr 9 04:53:22 hostname nomad: 2018/04/09 04:53:22.696985 [INFO] fingerprint.consul: consul agent is unavailable

It seems like when consul agent becomes healthy again, the fabio on this host doesn't ever recover. Would it be possible to handle this situation a little differently? This also seems related to https://github.com/fabiolb/fabio/issues/162

Thanks Frank.

magiconair commented 6 years ago

Fabio just does a long poll on the Consul API and waits for a notification of a raft change - any change. Once the Consul raft log changes fabio will pull all the data and rebuild the routing table. This is when you see all the Health changed to #nnnn messages. The number is the raft index.

When there is no notification there is little I can do. Fabio could time out the Consul connection and retry after some time but I still think that the issue is somewhere in the Consul/Nomad setup.

magiconair commented 6 years ago

Did you try different nomad and/or consul versions?

codyja commented 6 years ago

We're testing the latest fabio in our staging env, and should be putting into production soon. We're running the latest nomad and consul is 1.0.1. I will mention too that the consul agent on the host doesn't seem to fully go down, I think it's just hung up. We're thinking it's something on the host level, possibly backup process. The problem I think is when consul agent becomes healthy again on these hosts, fabio has gotten into an odd state and doesn't seem to do any further lookups until I restart it.

magiconair commented 6 years ago

If you're comfortable building your own fabio you could try adding a WaitTime: 5 * time.Second to the client config in

https://github.com/fabiolb/fabio/blob/master/registry/consul/backend.go#L23

Otherwise, the wait calls will block until there is a change. Don't set this value too low since fabio will pull all data every time. To prevent this add this as well:

https://github.com/fabiolb/fabio/blob/master/registry/consul/service.go#L31

if meta.LastIndex == lastIndex {
    continue
}

I can't see anything major related in the consul changelog but you could try the latest consul to see whether that makes a difference. The issue seems reproducible enough in your environment.

samm-git commented 6 years ago

I see exactly same issue in our env. We have about 12 fabios and sometime (not often) one of them stalling on update. To detect this i wrote a flapping service and checking for the [INFO] Config updates every few minutes. This allows me to detect this easily. It happens ~once a week, so not very easy to reproduce.

In my env fabio 1.5.8 is connected to the local consul 1.0.2. At the time of checking consul agent looks healthy and contains correct routing. May be it would be possible to add some kind of watchdog feature which would trigger reconnect to consul in case if updates are not received for > n minutes?

samm-git commented 6 years ago

@magiconair i think we can try to build a patched version to see if it helps. Only problem here is that we will have to wait few weeks to see if it solves the issue.

samm-git commented 6 years ago

@magiconair not sure if proposed workaround will change anything. From the consul doc:

In addition to index, endpoints that support blocking will also honor a wait parameter specifying a maximum duration for the blocking request. This is limited to 10 minutes. If not set, the wait time defaults to 5 minutes.

So it should be already set to 5 minutes. And i saw fabio-s hanged in "no-updates" state for at least few hrs. This makes me thinking that consul update thread is just hanged in that case.

murphymj25 commented 6 years ago

We're seeing a similar behavior with delays in Fabio reflecting route updates. We are using fabio 1.5.9 and consul 1.2.2 (upgraded from 1.0.2 to see if that would help). The interesting thing is that when Fabio does update a few minutes after the consul local agent updates, we do not see the route change in the logs. We are also seeing memory continue to grow. With the routes stored in memory, not sure if this is relates. It seems like this starts to happen when the devices have been up for a while and memory usage on the linux server is above 60%.

I have tried restarting the consul agent, but that doesn't seem to make a difference, after restarting fabio however, thing appear to go back to normal.

murphymj25 commented 6 years ago

It looks like the memory may be related to the metrics. We have opened a separate issue #530 and provided the detail in that thread.

aaronhurt commented 6 years ago

@murphymj25 Thank you for the analysis.

samm-git commented 6 years ago

Ok, i got this problem again. Fabio been not updating configuration. At the same time there was another instance on the same vm which was working just fine, so unlikely to be a consul problem.

I terminated it with a SIGTRAP command to get the trace. After restart it starts to work fine again. Trace is available at https://gist.github.com/samm-git/23963653d57045218a00564184f0647f

tpwow commented 6 years ago

i got the same problem with successive health changes in fabio. i look through the consul both consul server and the local consul agent log, but found nothing werd.

jorgemarey commented 5 years ago

Hi, we're experiencing this as well. Any updates on this? We're running fabio 1.5.9 with consul 1.5.1

It happened in production and we just restarted fabio to get it working again. If it happen again I'll get some logs.

codyja commented 5 years ago

We found that what triggered this issue was vmware level snapshots that were happening at the VM. When disk quiescing was happening (disk pausing) during the snapshot, something was hanging and not recovering in fabio. We ended up disabling vm level backups since these are docker nodes and didn't really need backups.

jorgemarey commented 5 years ago

Hi @codyja thank you for your answer. He are seeing this on AWS and don't have anything related to snapshots configured on our instances, but we do see sometimes some high IO wait values on these, so maybe could be related. I'll keep looking into this to see if I find anything more specific. Thanks!

vincenthuynh commented 4 years ago

Hi, I found this issue because I was encountering the same symptoms described. It turned out that one of our Consul servers was under high load and halting.

We discovered it when checking the raft log index for each server:

curl -H 'X-Consul-Token: XXXXX' http://127.0.0.1:8500/v1/operator/autopilot/health | jq '.'
aaronhurt commented 4 years ago

@vincenthuynh Would you have still found this issue if it was closed? I'm debating closing old questions but I'm not sure how that would affect people looking for information.

samm-git commented 4 years ago

@leprechau we are still facing this issue regularly, something like few times per month.

aaronhurt commented 4 years ago

@samm-git how is fabio connecting to consul? Do you have a consul client agent on the same machine that fabio connects to via localhost?

Could you kill -USR1 <pid-of-consul> and attach a dump of the consul telemetry next time this happens?

aaronhurt commented 4 years ago

You'll need to be logging consul output or have consul monitor up at the same time.

https://learn.hashicorp.com/consul/day-2-operations/monitoring?utm_source=consul.io&utm_medium=docs#sigusr1-for-local-use

vincenthuynh commented 4 years ago

@vincenthuynh Would you have still found this issue if it was closed? I'm debating closing old questions but I'm not sure how that would affect people looking for information.

@leprechau I usually search keywords across all issues (opened and closed). I look to the status to tell me whether the issue has a solution or workaround before I scroll to the bottom.

aaronhurt commented 4 years ago

@vincenthuynh Okay, thank you for the response. That helps.

samm-git commented 4 years ago

@leprechau

  1. yes, running on the same machine
  2. Ok, will do
aaronhurt commented 4 years ago

@samm-git Thank you. The telemetry data will help in determining the root cause.