kong doesn't connect to same cassandra cluster when IPs have changed

Raghavendarlokineni commented 4 years ago

Summary

Kong is connected to Cassandra cluster in the Kubernetes environment using the service name. As long as the nodes are up and running Kong is able to connect to the database and when the nodes are down, Kong reports database is not reachable which is expected.

But Kong is still trying to connect to the old set of IPs when Cassandra cluster is back with same service name with a different set of IPs. Due to this Kong is under the impression that Cassandra is down.

Tried kong reload after the Cassandra cluster is up and running, but kong doesn't connect to new IPs but failing with this error.

[cassandra] failed to refresh cluster topology: 
failed to acquire refresh lock: timeout (ver_refresh=8), context: ngx.timer

Kong Version : 2.0.4

Steps to reproduce:

1) start kong backed with Cassandra DB as cluster setup in Kubernetes env.

$ kubectl get pods -o wide
NAME                            READY   STATUS        RESTARTS   AGE   IP              NODE                 NOMINATED NODE
**-kong-24-hxrwl                1/1     Running       0          9m    10.128.7.210    ****                 <none>
**-kong-db-0                    1/1     Running       0          4d    10.128.32.102   ****                 <none>
**-kong-db-1                    1/1     Running       0          4d    10.131.54.67    ****                 <none>
**-kong-db-2                    1/1     Running       0          4d    10.131.5.50     ****                 <none>

from the kong logs


2020/08/26 12:28:06 [verbose] Kong: 2.0.4
--
  | 2020/08/26 12:28:06 [debug] ngx_lua: 10015
  | 2020/08/26 12:28:06 [debug] nginx: 1015008
  | 2020/08/26 12:28:06 [debug] Lua: LuaJIT 2.1.0-beta3
  | 2020/08/26 12:28:06 [verbose] no config file found at /etc/kong/kong.conf
  | 2020/08/26 12:28:06 [verbose] reading config file at /etc/kong.conf
  | 2020/08/26 12:28:06 [debug] reading environment variables
  | 2020/08/26 12:28:06 [debug] KONG_PLUGINS ENV found with "bundled,kong-splunk-log,kong-spec-expose,kong-upstream-jwt,kong-oidc-auth,prometheus,statsd,middleman,newrelic-insights"
  | 2020/08/26 12:28:06 [debug] KONG_ADMIN_LISTEN ENV found with "0.0.0.0:8001,0.0.0.0:8444 ssl"
  | 2020/08/26 12:28:06 [debug] KONG_PROXY_ACCESS_LOG ENV found with "/var/log/kong_proxy_access.log"
  | 2020/08/26 12:28:06 [debug] KONG_NGINX_ADMIN_SSL_CERTIFICATE_KEY ENV found with "/usr/local/kong/ssl/***REDACTED***"
  | 2020/08/26 12:28:06 [debug] KONG_CASSANDRA_USERNAME ENV found with "konguser"
  | 2020/08/26 12:28:06 [debug] KONG_SSL_CERT_KEY ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_ADMIN_SSL_CERT_KEY ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_CASSANDRA_PASSWORD ENV found with "******"
  | 2020/08/26 12:28:06 [debug] KONG_NGINX_ADMIN_SSL_CERTIFICATE ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_ADMIN_SSL_CERT ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_LOG_LEVEL ENV found with "info"
  | 2020/08/26 12:28:06 [debug] KONG_DATABASE ENV found with "cassandra"
  | 2020/08/26 12:28:06 [debug] KONG_NGINX_DAEMON ENV found with "off"
  | 2020/08/26 12:28:06 [debug] KONG_PROXY_ERROR_LOG ENV found with "/var/log/kong_proxy_error.log"
  | 2020/08/26 12:28:06 [debug] KONG_CASSANDRA_CONTACT_POINTS ENV found with "ie-kong-db"
  | 2020/08/26 12:28:06 [debug] KONG_CASSANDRA_CONSISTENCY ENV found with "QUORUM"
  | 2020/08/26 12:28:06 [debug] KONG_DB_CACHE_TTL ENV found with "60"
  | 2020/08/26 12:28:06 [debug] KONG_NGINX_HTTP_SSL_CERTIFICATE ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_DB_UPDATE_PROPAGATION ENV found with "5 # Default is 0"
  | 2020/08/26 12:28:06 [debug] KONG_SSL_CERT ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_NGINX_PROXY_SSL_CERTIFICATE_KEY ENV found with "/usr/local/kong/ssl/i***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_NGINX_HTTP_SSL_CERTIFICATE_KEY ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_ADMIN_ERROR_LOG ENV found with "/var/log/kong_admin_error.log"
  | 2020/08/26 12:28:06 [debug] KONG_NGINX_PROXY_SSL_CERTIFICATE ENV found with "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] KONG_DNS_RESOLVER ENV found with "10.86.78.165"
  | 2020/08/26 12:28:06 [debug] KONG_ADMIN_ACCESS_LOG ENV found with "/var/log/kong_admin_access.log"
  | 2020/08/26 12:28:06 [debug] admin_access_log = "/var/log/kong_admin_access.log"
  | 2020/08/26 12:28:06 [debug] admin_error_log = "/var/log/kong_admin_error.log"
  | 2020/08/26 12:28:06 [debug] admin_listen = {"0.0.0.0:8001","0.0.0.0:8444 ssl"}
  | 2020/08/26 12:28:06 [debug] admin_ssl_cert = "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] admin_ssl_cert_key = "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] anonymous_reports = true
  | 2020/08/26 12:28:06 [debug] cassandra_consistency = "QUORUM"
  | 2020/08/26 12:28:06 [debug] cassandra_contact_points = {"ie-kong-db"}
  | 2020/08/26 12:28:06 [debug] cassandra_data_centers = {"dc1:2","dc2:3"}
  | 2020/08/26 12:28:06 [debug] cassandra_keyspace = "kong"
  | 2020/08/26 12:28:06 [debug] cassandra_lb_policy = "RequestRoundRobin"
  | 2020/08/26 12:28:06 [debug] cassandra_password = "******"
  | 2020/08/26 12:28:06 [debug] cassandra_port = 9042
  | 2020/08/26 12:28:06 [debug] cassandra_refresh_frequency = 60
  | 2020/08/26 12:28:06 [debug] cassandra_repl_factor = 1
  | 2020/08/26 12:28:06 [debug] cassandra_repl_strategy = "SimpleStrategy"
  | 2020/08/26 12:28:06 [debug] cassandra_schema_consensus_timeout = 10000
  | 2020/08/26 12:28:06 [debug] cassandra_ssl = false
  | 2020/08/26 12:28:06 [debug] cassandra_ssl_verify = false
  | 2020/08/26 12:28:06 [debug] cassandra_timeout = 5000
  | 2020/08/26 12:28:06 [debug] cassandra_username = "konguser"
  | 2020/08/26 12:28:06 [debug] client_body_buffer_size = "8k"
  | 2020/08/26 12:28:06 [debug] client_max_body_size = "0"
  | 2020/08/26 12:28:06 [debug] client_ssl = false
  | 2020/08/26 12:28:06 [debug] cluster_control_plane = "127.0.0.1:8005"
  | 2020/08/26 12:28:06 [debug] cluster_listen = {"0.0.0.0:8005"}
  | 2020/08/26 12:28:06 [debug] database = "cassandra"
  | 2020/08/26 12:28:06 [debug] db_cache_ttl = 60
  | 2020/08/26 12:28:06 [debug] db_cache_warmup_entities = {"services","plugins"}
  | 2020/08/26 12:28:06 [debug] db_resurrect_ttl = 30
  | 2020/08/26 12:28:06 [debug] db_update_frequency = 5
  | 2020/08/26 12:28:06 [debug] dns_error_ttl = 1
  | 2020/08/26 12:28:06 [debug] dns_hostsfile = "/etc/hosts"
  | 2020/08/26 12:28:06 [debug] dns_no_sync = false
  | 2020/08/26 12:28:06 [debug] dns_not_found_ttl = 30
  | 2020/08/26 12:28:06 [debug] dns_order = {"LAST","SRV","A","CNAME"}
  | 2020/08/26 12:28:06 [debug] dns_resolver = {"10.86.78.165"}
  | 2020/08/26 12:28:06 [debug] dns_stale_ttl = 4
  | 2020/08/26 12:28:06 [debug] error_default_type = "text/plain"
  | 2020/08/26 12:28:06 [debug] go_plugins_dir = "off"
  | 2020/08/26 12:28:06 [debug] go_pluginserver_exe = "/usr/local/bin/go-pluginserver"
  | 2020/08/26 12:28:06 [debug] headers = {"server_tokens","latency_tokens"}
  | 2020/08/26 12:28:06 [debug] kic = false
  | 2020/08/26 12:28:06 [debug] log_level = "info"
  | 2020/08/26 12:28:06 [debug] lua_package_cpath = ""
  | 2020/08/26 12:28:06 [debug] lua_package_path = "./?.lua;./?/init.lua;"
  | 2020/08/26 12:28:06 [debug] lua_socket_pool_size = 30
  | 2020/08/26 12:28:06 [debug] lua_ssl_verify_depth = 1
  | 2020/08/26 12:28:06 [debug] mem_cache_size = "128m"
  | 2020/08/26 12:28:06 [debug] nginx_admin_directives = {{name="ssl_certificate_key",value="/usr/local/kong/ssl/***REDACTED***""},{name="ssl_certificate",value="/usr/local/kong/ssl/***REDACTED***""}}
  | 2020/08/26 12:28:06 [debug] nginx_admin_ssl_certificate = "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] nginx_admin_ssl_certificate_key = "/usr/local/kong/ssl***REDACTED***""
  | 2020/08/26 12:28:06 [debug] nginx_daemon = "off"
  | 2020/08/26 12:28:06 [debug] nginx_events_directives = {{name="worker_connections",value="auto"},{name="multi_accept",value="on"}}
  | 2020/08/26 12:28:06 [debug] nginx_events_multi_accept = "on"
  | 2020/08/26 12:28:06 [debug] nginx_events_worker_connections = "auto"
  | 2020/08/26 12:28:06 [debug] nginx_http_client_body_buffer_size = "8k"
  | 2020/08/26 12:28:06 [debug] nginx_http_client_max_body_size = "0"
  | 2020/08/26 12:28:06 [debug] nginx_http_directives = {{name="client_max_body_size",value="0"},{name="ssl_prefer_server_ciphers",value="off"},{name="client_body_buffer_size",value="8k"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_certificate",value="/usr/local/kong/ssl/***REDACTED***"},{name="ssl_session_tickets",value="on"},{name="ssl_certificate_key",value="/usr/local/kong/ssl/***REDACTED***""},{name="ssl_session_timeout",value="1d"}}
  | 2020/08/26 12:28:06 [debug] nginx_http_ssl_certificate = "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] nginx_http_ssl_certificate_key = "/usr/local/kong/ssl/***REDACTED***"
  | 2020/08/26 12:28:06 [debug] nginx_http_ssl_prefer_server_ciphers = "off"
  | 2020/08/26 12:28:06 [debug] nginx_http_ssl_protocols = "TLSv1.2 TLSv1.3"
  | 2020/08/26 12:28:06 [debug] nginx_http_ssl_session_tickets = "on"
  | 2020/08/26 12:28:06 [debug] nginx_http_ssl_session_timeout = "1d"
  | 2020/08/26 12:28:06 [debug] nginx_http_status_directives = {}
  | 2020/08/26 12:28:06 [debug] nginx_http_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
  | 2020/08/26 12:28:06 [debug] nginx_http_upstream_keepalive = "60"
  | 2020/08/26 12:28:06 [debug] nginx_http_upstream_keepalive_requests = "100"
  | 2020/08/26 12:28:06 [debug] nginx_http_upstream_keepalive_timeout = "60s"
  | 2020/08/26 12:28:06 [debug] nginx_main_daemon = "off"
  | 2020/08/26 12:28:06 [debug] nginx_main_directives = {{name="daemon",value="off"},{name="worker_rlimit_nofile",value="auto"},{name="worker_processes",value="auto"}}
  | 2020/08/26 12:28:06 [debug] nginx_main_worker_processes = "auto"
  | 2020/08/26 12:28:06 [debug] nginx_main_worker_rlimit_nofile = "auto"
  | 2020/08/26 12:28:06 [debug] nginx_optimizations = true
  | 2020/08/26 12:28:06 [debug] nginx_proxy_directives = {{name="real_ip_header",value="X-Real-IP"},{name="real_ip_recursive",value="off"},{name="ssl_certificate_key",value="/usr/local/kong/ssl/***REDACTED***""},{name="ssl_certificate",value="/usr/local/kong/ssl/***REDACTED***""}}
  | 2020/08/26 12:28:06 [debug] nginx_proxy_real_ip_header = "X-Real-IP"
  | 2020/08/26 12:28:06 [debug] nginx_proxy_real_ip_recursive = "off"
  | 2020/08/26 12:28:06 [debug] nginx_proxy_ssl_certificate = "/usr/local/kong/ssl/***REDACTED***"
  | 2020/08/26 12:28:06 [debug] nginx_proxy_ssl_certificate_key = "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] nginx_sproxy_directives = {}
  | 2020/08/26 12:28:06 [debug] nginx_status_directives = {}
  | 2020/08/26 12:28:06 [debug] nginx_stream_directives = {}
  | 2020/08/26 12:28:06 [debug] nginx_supstream_directives = {}
  | 2020/08/26 12:28:06 [debug] nginx_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
  | 2020/08/26 12:28:06 [debug] nginx_upstream_keepalive = "60"
  | 2020/08/26 12:28:06 [debug] nginx_upstream_keepalive_requests = "100"
  | 2020/08/26 12:28:06 [debug] nginx_upstream_keepalive_timeout = "60s"
  | 2020/08/26 12:28:06 [debug] nginx_worker_processes = "auto"
  | 2020/08/26 12:28:06 [debug] pg_database = "kong"
  | 2020/08/26 12:28:06 [debug] pg_host = "127.0.0.1"
  | 2020/08/26 12:28:06 [debug] pg_max_concurrent_queries = 0
  | 2020/08/26 12:28:06 [debug] pg_port = 5432
  | 2020/08/26 12:28:06 [debug] pg_semaphore_timeout = 60000
  | 2020/08/26 12:28:06 [debug] pg_ssl = false
  | 2020/08/26 12:28:06 [debug] pg_ssl_verify = false
  | 2020/08/26 12:28:06 [debug] pg_timeout = 5000
  | 2020/08/26 12:28:06 [debug] pg_user = "kong"
  | 2020/08/26 12:28:06 [debug] plugins = {"bundled","kong-splunk-log","kong-spec-expose","kong-upstream-jwt","kong-oidc-auth","prometheus","statsd","middleman","newrelic-insights"}
  | 2020/08/26 12:28:06 [debug] prefix = "/usr/local/kong/"
  | 2020/08/26 12:28:06 [debug] proxy_access_log = "/var/log/kong_proxy_access.log"
  | 2020/08/26 12:28:06 [debug] proxy_error_log = "/var/log/kong_proxy_error.log"
  | 2020/08/26 12:28:06 [debug] proxy_listen = {"0.0.0.0:8000 reuseport backlog=16384","0.0.0.0:8443 http2 ssl reuseport backlog=16384"}
  | 2020/08/26 12:28:06 [debug] real_ip_header = "X-Real-IP"
  | 2020/08/26 12:28:06 [debug] real_ip_recursive = "off"
  | 2020/08/26 12:28:06 [debug] role = "traditional"
  | 2020/08/26 12:28:06 [debug] router_consistency = "strict"
  | 2020/08/26 12:28:06 [debug] router_update_frequency = 1
  | 2020/08/26 12:28:06 [debug] ssl_cert = "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] ssl_cert_key = "/usr/local/kong/ssl/***REDACTED***""
  | 2020/08/26 12:28:06 [debug] ssl_cipher_suite = "intermediate"
  | 2020/08/26 12:28:06 [debug] ssl_ciphers = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"
  | 2020/08/26 12:28:06 [debug] status_access_log = "off"
  | 2020/08/26 12:28:06 [debug] status_error_log = "logs/status_error.log"
  | 2020/08/26 12:28:06 [debug] status_listen = {"off"}
  | 2020/08/26 12:28:06 [debug] stream_listen = {"off"}
  | 2020/08/26 12:28:06 [debug] trusted_ips = {}
  | 2020/08/26 12:28:06 [debug] upstream_keepalive = 60
  | 2020/08/26 12:28:06 [verbose] prefix in use: /usr/local/kong
  | 2020/08/26 12:28:06 [debug] resolved Cassandra contact point 'ie-kong-db' to: 10.128.32.102
  | 2020/08/26 12:28:06 [debug] loading subsystems migrations...
  | 2020/08/26 12:28:06 [verbose] retrieving keyspace schema state...
  | 2020/08/26 12:28:07 [verbose] schema state retrieved
  | 2020/08/26 12:28:07 [debug] loading subsystems migrations...
  | 2020/08/26 12:28:07 [verbose] retrieving keyspace schema state...
  | 2020/08/26 12:28:07 [verbose] schema state retrieved
  | 2020/08/26 12:28:07 [info] Database is already up-to-date
  | Starting Kong
  | nginx: [warn] [lua] cluster.lua:597: refresh(): [lua-cassandra] found contact point with '0.0.0.0' as rpc_address, using '10.128.32.102' to contact it instead. If this is incorrect you should avoid using '0.0.0.0' in rpc_address

few key configs/params from the above logs:

  | 2020/08/26 12:28:06 [debug] cassandra_consistency = "QUORUM"
  | 2020/08/26 12:28:06 [debug] cassandra_contact_points = {"ie-kong-db"}
  | 2020/08/26 12:28:06 [debug] cassandra_data_centers = {"dc1:2","dc2:3"}
  | 2020/08/26 12:28:06 [debug] cassandra_keyspace = "kong"
  | 2020/08/26 12:28:06 [debug] cassandra_lb_policy = "RequestRoundRobin"
  | 2020/08/26 12:28:06 [debug] cassandra_port = 9042
  | 2020/08/26 12:28:06 [debug] cassandra_refresh_frequency = 60
  | 2020/08/26 12:28:06 [debug] cassandra_repl_factor = 1
  | 2020/08/26 12:28:06 [debug] cassandra_repl_strategy = "SimpleStrategy"
  | 2020/08/26 12:28:06 [debug] cassandra_schema_consensus_timeout = 10000
  | 2020/08/26 12:28:06 [debug] cassandra_ssl = false
  | 2020/08/26 12:28:06 [debug] cassandra_ssl_verify = false
  | 2020/08/26 12:28:06 [debug] cassandra_timeout = 5000
  resolved Cassandra contact point 'ie-kong-db' to: 10.128.32.102

2. make Cassandra cluster down/stop

3. restart the cassandra cluster, kubernetes allocates them with new IPs

$ kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP              NODE                 NOMINATED NODE
**-kong-24-hxrwl                1/1     Running   0          17m   10.128.7.210    ****                 <none>
**-kong-db-0                    1/1     Running   0          4m    10.129.63.86    ****                   <none>
**-kong-db-1                    1/1     Running   0          3m    10.129.45.34    ****                   <none>
**-kong-db-2                    1/1     Running   0          2m    10.128.52.203   ****                 <none>

4. Wait for cassandra_refresh_frequency = 60 for kong to get new set of IPs, but nothing happened.

kong trying with old Cassandra IPs

2020/08/26 16:33:39 [error] 167#0: *745817 [lua] init.lua:400: [cluster_events] failed to poll: 
failed to retrieve events from DB:all hosts tried for query failed. 10.128.32.102: 
host still considered down for 33.393s (last error: no route to host). 
10.131.54.67: host still considered down for 16.239s (last error: no route to host). 
10.131.5.50: host still considered down for 19.245s (last error: no route to host), context: ngx.timer

5. So did kong reload to clear the cache and connect to new IPs.

2020/08/26 12:51:09 [error] 172#0: *66865 [lua] connector.lua:275: [cassandra] failed to refresh cluster topology: 
failed to acquire refresh lock: timeout (ver_refresh=8), context: ngx.timer
2020/08/26 12:51:09 [error] 169#0: *66866 [lua] connector.lua:275: [cassandra] failed to refresh cluster topology: 
failed to acquire refresh lock: timeout (ver_refresh=8), context: ngx.timer
2020/08/26 12:51:09 [crit] 166#0: *69915 [lua] init.lua:298: [cluster_events] no 'at' in shm, polling events from: 1598446259.565, context: ngx.timer
2020/08/26 12:51:09 [error] 166#0: *69915 [lua] init.lua:400: [cluster_events] failed to poll: 
failed to retrieve events from DB: all hosts tried for query failed. 
10.128.32.102: host still considered down for 32.974s (last error: no route to host). 
10.131.54.67: host still considered down for 12.758s (last error: no route to host). 
10.131.5.50: host still considered down for 24.98s (last error: no route to host), context: ngx.timer

2020/08/26 12:51:09 [error] 171#0: *66862 [lua] connector.lua:275: [cassandra] failed to refresh cluster topology: 
failed to acquire refresh lock: timeout (ver_refresh=8), context: ngx.timer

Similar issue was reported couple of years back here - https://github.com/Kong/kong/issues/2674

and from the discussion here - https://github.com/Kong/kong/pull/3752, this was resolved in this PR - https://github.com/Kong/kong/pull/5071/files.

Can someone please check this and let us know if anything missing from our end.

Thanks!

Raghavendarlokineni commented 4 years ago

@thibaultcha @hishamhm @bungle can someone please check and help us to resolve this issue.

Raghavendarlokineni commented 4 years ago

@darrenjennings @daviesf1 @deirdre-anderson @DMarby notifying few more people so someone can help us here.

hutchic commented 4 years ago

Could you exec into the Kong node and do a dig (or equivalent) on the cassandra service name to check the new ip is resolving correctly

Raghavendarlokineni commented 4 years ago

@hutchic connection from Kong pod/node to Cassandra cluster DB is resolvable with service name even after change of IPs.

 wget ie-kong-db.******.svc
--2020-09-03 07:14:53--  http://ie-kong-db.*****.svc/
Resolving ie-kong-db.****.svc (ie-kong-db.****.svc)... 10.129.45.172, 10.131.28.98, 10.131.45.180

I am suspecting two things here

failed to refresh topology
kong still trying for old set of IPs, probably cache is not cleared even after kong reload

[cassandra] failed to refresh cluster topology: failed to acquire refresh lock: timeout (ver_refresh=18830), context: ngx.timer
2020/09/03 07:17:43 [error] 163#0: *31626769 [lua] connector.lua:275: [cassandra] failed to refresh cluster topology: failed to acquire refresh lock: timeout (ver_refresh=18830), context: ngx.timer
2020/09/03 07:17:46 [warn] 163#0: *31629856 [lua] cluster.lua:182: set_peer_down(): [lua-cassandra] setting host at 10.129.45.34 DOWN, context: ngx.timer
2020/09/03 07:17:46 [crit] 163#0: *31629856 [lua] init.lua:298: [cluster_events] no 'at' in shm, polling events from: 1599117456.475, context: ngx.timer
2020/09/03 07:17:49 [warn] 163#0: *31629856 [lua] cluster.lua:182: set_peer_down(): [lua-cassandra] setting host at 10.128.52.203 DOWN, context: ngx.timer
2020/09/03 07:17:49 [error] 163#0: *31629856 [lua] init.lua:400: [cluster_events] failed to poll: failed to retrieve events from DB: all hosts tried for query failed. 10.129.63.86: host still considered down for 10.91s (last error: no route to host). 10.129.45.34: host still considered down for 60s (last error: no route to host). 10.128.52.203: host seems unhealthy, considering it down (no route to host), context: ngx.timer

Raghavendarlokineni commented 4 years ago

@hutchic can you check the latest logs and guide us what went wrong here.

Raghavendarlokineni commented 4 years ago

@hutchic required info was updated, can someone check this and update

Raghavendarlokineni commented 4 years ago

@Tieske could you please check this once or assign it to relevant team, @hutchic has labelled this issue as pending on author and no one is looking into this issue even after providing sufficient logs.

Raghavendarlokineni commented 4 years ago

@kikito could you check this once.

rsbrisci commented 4 years ago

K so we're hitting https://github.com/Kong/kong/blob/254deec3cceef78654a1cba6eb32798c417e993a/kong/db/strategies/cassandra/connector.lua#L275

err is [cassandra] failed to refresh cluster topology: failed to acquire refresh lock: timeout (ver_refresh=18830), context: ngx.timer

Which comes from https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L563

Is it possible getting the lock on that "Cassandra" dict times out because the existing contact points are unavailable?

thibaultcha commented 4 years ago

Hi there,

To me, the issue here seems to be that in restarting all of your Cassandra nodes at once, your are also invalidating the IP address of the - only - contact point you are providing Kong with, aka ie-kong-db. Kong resolves the C contact points before instantiating the C cluster object. This means that currently, your contact point's IP addresses should not change once Kong has started. The address of other nodes in the C* cluster can change, but not that of all contact points. This is originally due to underlying OpenResty limitations with regards to resolving hostnames, see this note in cassandra/connector.lua. Underlying OpenResty libraries such as lua-cassandra must generally accept already resolved hostnames since they can never assume that a DNS resolver is available.

kong reload does not help because it only restarts the Nginx workers, while the DB singleton is initialized in the master process. For the worker processes to fork() with a new DB singleton, Kong needs to be restarted entirely at the moment. It also does not clear the kong_cassandra shared memory zone, so the cluster refresh lock is probably still in there and prevents new workers from acquiring it until the lock expires (30 seconds, currently not configurable; but this would not help here anyway).

For now, I'd suggest:

That you specify more than a single contact point for your C* cluster, and to guarantee that some of these CPs will stay up
A patch for Kong in order for workers to initialize their own DB singleton so that kong reload after changes to cassandra_contact_points can take effect.
A patch for lua-cassandra to support specifying contact points as hostnames. I believe this to be the most involved patch since it'll require workarounds for underlying OpenResty limitations with cosockets, even maybe embedding a DNS resolver within it.

Raghavendarlokineni commented 4 years ago

@thibaultcha thanks for responding. We can't pass multiple contact points as Cassandra is already setup in cluster mode and restarting of this service will change the IP address for pods/containers as part of Kubernetes/Openshift functionality.

If I understand correctly, this issue can be resolved with some changes in Kong to support new_contact points after kong_reload.

Raghavendarlokineni commented 3 years ago

@thibaultcha any tentative dates when this feature will be available in open source?

Raghavendarlokineni commented 3 years ago

@thibaultcha it would be helpful if someone can provide us the updates on a resolution for this issue?

thibaultcha commented 3 years ago

cc @guanlan @dndx

Raghavendarlokineni commented 3 years ago

@guanlan @dndx can someone share the update on this issue?

Raghavendarlokineni commented 3 years ago

@thibaultcha @guanlan @dndx can someone share the update on this issue?

Kong / kong

kong doesn't connect to same cassandra cluster when IPs have changed #6279