fabiolb / fabio

Consul Load-Balancing made simple
https://fabiolb.net
MIT License
7.26k stars 618 forks source link

Fabio adds routes for TCP proxy even when consul health checks are failing #338

Open akissa opened 7 years ago

akissa commented 7 years ago

I have 3 postgresql servers in a cluster, only one is in write mode at a time. The service and checks for each node are registered in consul. The active node passes all checks and is green and logs as passing. The two other nodes fail checks with orange in the consul ui and log failure of the checks.

Even with the above, fabio adds routes to all the 3 servers when it should only have a route to the master which passes the checks.

magiconair commented 7 years ago

That shouldn't be the case. I'll have a look. Do you have unique service ids?

akissa commented 7 years ago

Yes the service ids and names are unique, the are based on service-hostname

magiconair commented 7 years ago

This code filters out services which are not passing before doing anything else and this is AFAICT protocol agnostic:

https://github.com/fabiolb/fabio/blob/master/registry/consul/passing.go#L13-L49

Can you provide the output of curl 'localhost:8500/v1/health/state/any?consistent&pretty'?

akissa commented 7 years ago

The output is below. I think it happens when a service has more than one check. I tried it with just the one check and it was working okay. With the two checks per service as below it adds the route even when 1 of the checks is critical.

Consul

curl '192.168.1.34:8500/v1/health/state/any?consistent&pretty'
[
    {
        "Node": "db.home.topdog-software.com",
        "CheckID": "service:pgsql-db.home.topdog-software.com:1",
        "Name": "pgsql replica on db.home.topdog-software.com",
        "Status": "critical",
        "Notes": "",
        "Output": "HTTP GET http://192.168.1.15:8008: 503 Service Unavailable Output: {\"database_system_identifier\": \"6428526973303804376\", \"postmaster_start_time\": \"2017-08-25 06:54:28.913 UTC\", \"xlog\": {\"received_location\": 2229388368, \"replayed_timestamp\": \"2017-08-25 07:08:22.186 UTC\", \"paused\": false, \"replayed_location\": 2229388368}, \"patroni\": {\"scope\": \"baruwa\", \"version\": \"1.3.3\"}, \"state\": \"running\", \"role\": \"replica\", \"server_version\": 90604}",
        "ServiceID": "pgsql-db.home.topdog-software.com",
        "ServiceName": "pgsql-db.home.topdog-software.com",
        "ServiceTags": [
            "urlprefix-:5432 proto=tcp"
        ],
        "CreateIndex": 45331,
        "ModifyIndex": 45471
    },
    {
        "Node": "db3.home.topdog-software.com",
        "CheckID": "service:pgsql-db3.home.topdog-software.com:1",
        "Name": "pgsql replica on db3.home.topdog-software.com",
        "Status": "critical",
        "Notes": "",
        "Output": "HTTP GET http://192.168.1.34:8008: 503 Service Unavailable Output: {\"database_system_identifier\": \"6428526973303804376\", \"postmaster_start_time\": \"2017-08-25 06:54:29.518 UTC\", \"xlog\": {\"received_location\": 2229388368, \"replayed_timestamp\": \"2017-08-25 07:08:22.186 UTC\", \"paused\": false, \"replayed_location\": 2229388368}, \"patroni\": {\"scope\": \"baruwa\", \"version\": \"1.3.3\"}, \"state\": \"running\", \"role\": \"replica\", \"server_version\": 90604}",
        "ServiceID": "pgsql-db3.home.topdog-software.com",
        "ServiceName": "pgsql-db3.home.topdog-software.com",
        "ServiceTags": [
            "urlprefix-:5432 proto=tcp"
        ],
        "CreateIndex": 45424,
        "ModifyIndex": 45474
    },
    {
        "Node": "db.home.topdog-software.com",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "CreateIndex": 40965,
        "ModifyIndex": 45328
    },
    {
        "Node": "db.home.topdog-software.com",
        "CheckID": "service:pgsql-db.home.topdog-software.com:2",
        "Name": "pgbouncer on db.home.topdog-software.com",
        "Status": "passing",
        "Notes": "",
        "Output": "TCP connect 192.168.1.15:5432: Success",
        "ServiceID": "pgsql-db.home.topdog-software.com",
        "ServiceName": "pgsql-db.home.topdog-software.com",
        "ServiceTags": [
            "urlprefix-:5432 proto=tcp"
        ],
        "CreateIndex": 45333,
        "ModifyIndex": 45470
    },
    {
        "Node": "db2.home.topdog-software.com",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "CreateIndex": 40965,
        "ModifyIndex": 40965
    },
    {
        "Node": "db2.home.topdog-software.com",
        "CheckID": "service:pgsql-db2.home.topdog-software.com:1",
        "Name": "pgsql replica on db2.home.topdog-software.com",
        "Status": "passing",
        "Notes": "",
        "Output": "HTTP GET http://192.168.1.33:8008: 200 OK Output: {\"database_system_identifier\": \"6428526973303804376\", \"postmaster_start_time\": \"2017-08-25 06:46:43.462 UTC\", \"xlog\": {\"location\": 2229387608}, \"patroni\": {\"scope\": \"baruwa\", \"version\": \"1.3.3\"}, \"replication\": [{\"sync_state\": \"sync\", \"sync_priority\": 1, \"client_addr\": \"192.168.1.15\", \"state\": \"streaming\", \"application_name\": \"db.home.topdog-software.com\", \"usename\": \"replicator\"}, {\"sync_state\": \"async\", \"sync_priority\": 0, \"client_addr\": \"192.168.1.34\", \"state\": \"streaming\", \"application_name\": \"db3.home.topdog-software.com\", \"usename\": \"replicator\"}], \"state\": \"running\", \"role\": \"master\", \"server_version\": 90604}",
        "ServiceID": "pgsql-db2.home.topdog-software.com",
        "ServiceName": "pgsql-db2.home.topdog-software.com",
        "ServiceTags": [
            "urlprefix-:5432 proto=tcp"
        ],
        "CreateIndex": 45381,
        "ModifyIndex": 45458
    },
    {
        "Node": "db2.home.topdog-software.com",
        "CheckID": "service:pgsql-db2.home.topdog-software.com:2",
        "Name": "pgbouncer on db2.home.topdog-software.com",
        "Status": "passing",
        "Notes": "",
        "Output": "TCP connect 192.168.1.33:5432: Success",
        "ServiceID": "pgsql-db2.home.topdog-software.com",
        "ServiceName": "pgsql-db2.home.topdog-software.com",
        "ServiceTags": [
            "urlprefix-:5432 proto=tcp"
        ],
        "CreateIndex": 45383,
        "ModifyIndex": 45459
    },
    {
        "Node": "db3.home.topdog-software.com",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "CreateIndex": 40965,
        "ModifyIndex": 45004
    },
    {
        "Node": "db3.home.topdog-software.com",
        "CheckID": "service:pgsql-db3.home.topdog-software.com:2",
        "Name": "pgbouncer on db3.home.topdog-software.com",
        "Status": "passing",
        "Notes": "",
        "Output": "TCP connect 192.168.1.34:5432: Success",
        "ServiceID": "pgsql-db3.home.topdog-software.com",
        "ServiceName": "pgsql-db3.home.topdog-software.com",
        "ServiceTags": [
            "urlprefix-:5432 proto=tcp"
        ],
        "CreateIndex": 45425,
        "ModifyIndex": 45475
    }
]

Fabio

./fabio -cfg fabio.properties
2017/08/25 09:13:11 [INFO] Runtime config
{
    "Proxy": {
        "Strategy": "rnd",
        "Matcher": "prefix",
        "NoRouteStatus": 404,
        "MaxConn": 10000,
        "ShutdownWait": 0,
        "DialTimeout": 30000000000,
        "ResponseHeaderTimeout": 0,
        "KeepAliveTimeout": 0,
        "FlushInterval": 1000000000,
        "LocalIP": "192.168.1.20",
        "ClientIPHeader": "",
        "TLSHeader": "",
        "TLSHeaderValue": "",
        "GZIPContentTypes": null,
        "RequestID": ""
    },
    "Registry": {
        "Backend": "consul",
        "Static": {
            "Routes": ""
        },
        "File": {
            "Path": ""
        },
        "Consul": {
            "Addr": "192.168.1.34:8500",
            "Scheme": "http",
            "Token": "",
            "KVPath": "/fabio/config",
            "TagPrefix": "urlprefix-",
            "Register": false,
            "ServiceAddr": ":9998",
            "ServiceName": "fabio",
            "ServiceTags": null,
            "ServiceStatus": [
                "passing"
            ],
            "CheckInterval": 1000000000,
            "CheckTimeout": 3000000000,
            "CheckScheme": "http",
            "CheckTLSSkipVerify": false
        },
        "Timeout": 10000000000,
        "Retry": 500000000
    },
    "Listen": [
        {
            "Addr": ":5432",
            "Proto": "tcp",
            "ReadTimeout": 0,
            "WriteTimeout": 0,
            "CertSource": {
                "Name": "",
                "Type": "",
                "CertPath": "",
                "KeyPath": "",
                "ClientCAPath": "",
                "CAUpgradeCN": "",
                "Refresh": 0,
                "Header": null
            },
            "StrictMatch": false,
            "TLSMinVersion": 0,
            "TLSMaxVersion": 0,
            "TLSCiphers": null
        }
    ],
    "Log": {
        "AccessFormat": "common",
        "AccessTarget": "",
        "RoutesFormat": "delta"
    },
    "Metrics": {
        "Target": "",
        "Prefix": "{{clean .Hostname}}.{{clean .Exec}}",
        "Names": "{{clean .Service}}.{{clean .Host}}.{{clean .Path}}.{{clean .TargetURL.Host}}",
        "Interval": 30000000000,
        "GraphiteAddr": "",
        "StatsDAddr": "",
        "Circonus": {
            "APIKey": "",
            "APIApp": "fabio",
            "APIURL": "",
            "CheckID": "",
            "BrokerID": ""
        }
    },
    "UI": {
        "Listen": {
            "Addr": ":9998",
            "Proto": "http",
            "ReadTimeout": 0,
            "WriteTimeout": 0,
            "CertSource": {
                "Name": "",
                "Type": "",
                "CertPath": "",
                "KeyPath": "",
                "ClientCAPath": "",
                "CAUpgradeCN": "",
                "Refresh": 0,
                "Header": null
            },
            "StrictMatch": false,
            "TLSMinVersion": 0,
            "TLSMaxVersion": 0,
            "TLSCiphers": null
        },
        "Color": "light-green",
        "Title": "",
        "Access": "rw"
    },
    "Runtime": {
        "GOGC": 800,
        "GOMAXPROCS": 2
    },
    "ProfileMode": "",
    "ProfilePath": "/tmp"
}
2017/08/25 09:13:11 [INFO] Version 1.5.2 starting
2017/08/25 09:13:11 [INFO] Go runtime is go1.8.3
2017/08/25 09:13:11 [INFO] Metrics disabled
2017/08/25 09:13:11 [INFO] Setting GOGC=800
2017/08/25 09:13:11 [INFO] Setting GOMAXPROCS=2
2017/08/25 09:13:11 [INFO] consul: Connecting to "192.168.1.34:8500" in datacenter "dc1"
2017/08/25 09:13:11 [INFO] consul: Not registering fabio in consul
2017/08/25 09:13:11 [INFO] Admin server access mode "rw"
2017/08/25 09:13:11 [INFO] Admin server listening on ":9998"
2017/08/25 09:13:11 [INFO] Waiting for first routing table
2017/08/25 09:13:11 [INFO] consul: Using dynamic routes
2017/08/25 09:13:11 [INFO] consul: Using tag prefix "urlprefix-"
2017/08/25 09:13:11 [INFO] consul: Watching KV path "/fabio/config"
2017/08/25 09:13:11 [INFO] consul: Manual config changed to #45610
2017/08/25 09:13:11 [INFO] consul: Health changed to #45609
2017/08/25 09:13:11 [INFO] TCP proxy listening on :5432
2017/08/25 09:13:11 [INFO] Config updates
+ route add pgsql-db3.home.topdog-software.com :5432 tcp://192.168.1.34:5432
+ route add pgsql-db2.home.topdog-software.com :5432 tcp://192.168.1.33:5432
+ route add pgsql-db.home.topdog-software.com :5432 tcp://192.168.1.15:5432
2017/08/25 09:13:15 [INFO] consul: Manual config changed to #45611
magiconair commented 7 years ago

@akissa Thx. This area of the code could indeed use some more tests. I'll have a look next week if that's OK. This is my last day of vacation.

akissa commented 7 years ago

No problem, i have actually refactored my workflow to remove the services of the slaves and only add the master so its okay. That way i do not have to deal with the logs filling up with service check errors for the slave systems.

On failover the demoted master gets removed and the promoted one inserted.

The fix would come in handy though for other scenarios where you are actually load balancing the connections.

pvandervelde commented 6 years ago

@magiconair Has this issue ever been fixed? I am seeing the same problem with Fabio 1.5.9 with services with multiple health checks. For me the output of curl 'localhost:8500/v1/health/state/any?consistent&pretty'? is (unrelated nodes and services filtered out)

[
    {
        "Node": "NZDINH10-01",
        "CheckID": "notifications-NZDINH10-01 - mode.operating",
        "Name": "mode.operating",
        "Status": "critical",
        "Notes": "",
        "Output": "Maintenance mode:enabled",
        "ServiceID": "notifications-NZDINH10-01",
        "ServiceName": "notifications",
        "ServiceTags": [
            "http",
            "edgeproxyprefix-/services/notifications strip=/services/notifications"
        ],
        "Definition": {},
        "CreateIndex": 19375125,
        "ModifyIndex": 19468443
    },
    {
        "Node": "NZDINH10-01",
        "CheckID": "notifications-NZDINH10-01 - dependency: http.metrics",
        "Name": "dependency: http.metrics",
        "Status": "passing",
        "Notes": "",
        "Output": "http.metrics - Passing",
        "ServiceID": "notifications-NZDINH10-01",
        "ServiceName": "notifications",
        "ServiceTags": [
            "http",
            "edgeproxyprefix-/services/notifications strip=/services/notifications"
        ],
        "Definition": {},
        "CreateIndex": 19375124,
        "ModifyIndex": 19375358
    },
    {
        "Node": "NZDINH10-01",
        "CheckID": "notifications-NZDINH10-01 - dependency: http.queue",
        "Name": "dependency: http.queue",
        "Status": "passing",
        "Notes": "",
        "Output": "http.queue - Passing",
        "ServiceID": "notifications-NZDINH10-01",
        "ServiceName": "notifications",
        "ServiceTags": [
            "http",
            "edgeproxyprefix-/services/notifications strip=/services/notifications"
        ],
        "Definition": {},
        "CreateIndex": 19375123,
        "ModifyIndex": 19375343
    },
    {
        "Node": "NZDINH10-01",
        "CheckID": "notifications-NZDINH10-01 - vault.authentication",
        "Name": "vault.authentication",
        "Status": "passing",
        "Notes": "",
        "Output": "Credentials: authenticated",
        "ServiceID": "notifications-NZDINH10-01",
        "ServiceName": "notifications",
        "ServiceTags": [
            "http",
            "edgeproxyprefix-/services/notifications strip=/services/notifications"
        ],
        "Definition": {},
        "CreateIndex": 19375126,
        "ModifyIndex": 19375975
    },
    {
        "Node": "NZDINH10-01",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Definition": {},
        "CreateIndex": 18772817,
        "ModifyIndex": 18772817
    },
]
pvandervelde commented 6 years ago

For all those that come across this at some point. It looks like there might be a fix for this in PR #428 which adds strict health checking.