Closed Raghavendarlokineni closed 3 years ago
@thibaultcha @hishamhm @bungle can someone please check and help us to resolve this issue.
@darrenjennings @daviesf1 @deirdre-anderson @DMarby notifying few more people so someone can help us here.
Could you exec into the Kong node and do a dig (or equivalent) on the cassandra service name to check the new ip is resolving correctly
@hutchic connection from Kong pod/node to Cassandra cluster DB is resolvable with service name even after change of IPs.
wget ie-kong-db.******.svc
--2020-09-03 07:14:53-- http://ie-kong-db.*****.svc/
Resolving ie-kong-db.****.svc (ie-kong-db.****.svc)... 10.129.45.172, 10.131.28.98, 10.131.45.180
I am suspecting two things here
kong reload
[cassandra] failed to refresh cluster topology: failed to acquire refresh lock: timeout (ver_refresh=18830), context: ngx.timer
2020/09/03 07:17:43 [error] 163#0: *31626769 [lua] connector.lua:275: [cassandra] failed to refresh cluster topology: failed to acquire refresh lock: timeout (ver_refresh=18830), context: ngx.timer
2020/09/03 07:17:46 [warn] 163#0: *31629856 [lua] cluster.lua:182: set_peer_down(): [lua-cassandra] setting host at 10.129.45.34 DOWN, context: ngx.timer
2020/09/03 07:17:46 [crit] 163#0: *31629856 [lua] init.lua:298: [cluster_events] no 'at' in shm, polling events from: 1599117456.475, context: ngx.timer
2020/09/03 07:17:49 [warn] 163#0: *31629856 [lua] cluster.lua:182: set_peer_down(): [lua-cassandra] setting host at 10.128.52.203 DOWN, context: ngx.timer
2020/09/03 07:17:49 [error] 163#0: *31629856 [lua] init.lua:400: [cluster_events] failed to poll: failed to retrieve events from DB: all hosts tried for query failed. 10.129.63.86: host still considered down for 10.91s (last error: no route to host). 10.129.45.34: host still considered down for 60s (last error: no route to host). 10.128.52.203: host seems unhealthy, considering it down (no route to host), context: ngx.timer
@hutchic can you check the latest logs and guide us what went wrong here.
@hutchic required info was updated, can someone check this and update
@Tieske could you please check this once or assign it to relevant team, @hutchic has labelled this issue as pending on author
and no one is looking into this issue even after providing sufficient logs.
@kikito could you check this once.
K so we're hitting https://github.com/Kong/kong/blob/254deec3cceef78654a1cba6eb32798c417e993a/kong/db/strategies/cassandra/connector.lua#L275
err
is [cassandra] failed to refresh cluster topology: failed to acquire refresh lock: timeout (ver_refresh=18830), context: ngx.timer
Which comes from https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L563
Is it possible getting the lock on that "Cassandra" dict times out because the existing contact points are unavailable?
Hi there,
To me, the issue here seems to be that in restarting all of your Cassandra nodes at once, your are also invalidating the IP address of the - only - contact point you are providing Kong with, aka ie-kong-db
. Kong resolves the C contact points before instantiating the C cluster object. This means that currently, your contact point's IP addresses should not change once Kong has started. The address of other nodes in the C* cluster can change, but not that of all contact points.
This is originally due to underlying OpenResty limitations with regards to resolving hostnames, see this note in cassandra/connector.lua
. Underlying OpenResty libraries such as lua-cassandra must generally accept already resolved hostnames since they can never assume that a DNS resolver is available.
kong reload
does not help because it only restarts the Nginx workers, while the DB singleton is initialized in the master process. For the worker processes to fork()
with a new DB singleton, Kong needs to be restarted entirely at the moment. It also does not clear the kong_cassandra
shared memory zone, so the cluster refresh lock is probably still in there and prevents new workers from acquiring it until the lock expires (30 seconds, currently not configurable; but this would not help here anyway).
For now, I'd suggest:
kong reload
after changes to cassandra_contact_points
can take effect.@thibaultcha thanks for responding. We can't pass multiple contact points as Cassandra is already setup in cluster mode and restarting of this service will change the IP address for pods/containers as part of Kubernetes/Openshift functionality.
If I understand correctly, this issue can be resolved with some changes in Kong to support new_contact points after kong_reload
.
@thibaultcha any tentative dates when this feature will be available in open source?
@thibaultcha it would be helpful if someone can provide us the updates on a resolution for this issue?
cc @guanlan @dndx
@guanlan @dndx can someone share the update on this issue?
@thibaultcha @guanlan @dndx can someone share the update on this issue?
Summary
Kong is connected to Cassandra cluster in the Kubernetes environment using the service name. As long as the nodes are up and running Kong is able to connect to the database and when the nodes are down, Kong reports database is not reachable which is expected.
But Kong is still trying to connect to the old set of IPs when Cassandra cluster is back with same service name with a different set of IPs. Due to this Kong is under the impression that Cassandra is down.
Tried
kong reload
after the Cassandra cluster is up and running, but kong doesn't connect to new IPs but failing with this error.Kong Version : 2.0.4
Steps to reproduce:
1) start kong backed with Cassandra DB as cluster setup in Kubernetes env.
from the kong logs
few key configs/params from the above logs:
2. make Cassandra cluster down/stop
3. restart the cassandra cluster, kubernetes allocates them with new IPs
4. Wait for
cassandra_refresh_frequency = 60
for kong to get new set of IPs, but nothing happened.kong trying with old Cassandra IPs
5. So did kong reload to clear the cache and connect to new IPs.
Similar issue was reported couple of years back here - https://github.com/Kong/kong/issues/2674
and from the discussion here - https://github.com/Kong/kong/pull/3752, this was resolved in this PR - https://github.com/Kong/kong/pull/5071/files.
Can someone please check this and let us know if anything missing from our end.
Thanks!