Closed jeremyjpj0916 closed 3 years ago
Thank you for the detailed report as always! We'll be looking into it.
Yeah I think it would have worked fine if the finish
had not died reaching schema consensus and continued processing. But that can't always be the case it looks like for us, therefore can't be assumed in code. I tried this same thing in another of our prod environments and same errors happened too. I decided to check the C nodes as well looking for any discrepancies in schema mismatches or cluster healthy but nothing shows up there either. Because this may point to C issues potentially as well alongside Kong not being able to handle all situations(such as restarting a migrations finish because the first execution didn't work out).
root@server05502:/root
# nodetool describecluster
Cluster Information:
Name: Stargate
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
12b6ca53-8e84-3a66-8793-7ce091b5dcca: [10.204.90.234, 10.86.173.32, 10.204.90.235, 10.106.184.179, 10.106.184.193, 10.87.52.252]
root@server05502:/root
# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.204.90.234 113.89 MiB 256 100.0% 5be9bdf7-1337-42ac-8177-341044a4758b RACK2
UN 10.86.173.32 117.4 MiB 256 100.0% 0f8d693c-4d99-453b-826e-6e5c68538037 RACK6
UN 10.204.90.235 335.54 MiB 256 100.0% 8e3192a7-e60c-41fb-8ca6-ce809440d982 RACK4
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.106.184.179 117.48 MiB 256 100.0% 33747998-70fd-42c4-87cb-c7c49a182c3e RACK1
UN 10.87.52.252 104.64 MiB 256 100.0% 74a60b30-c83c-4aa8-b3ae-7ccb16fb817f RACK5
UN 10.106.184.193 107.06 MiB 256 100.0% 61320d76-aa12-43f0-9684-b1413e57c90d RACK3
So that wasn't any help. I wonder if there are any analytic health checks I can run on the cloned keyspace itself to check for weirdness or mismatches too.
Edit - Used another node in other DC w same commands above and they checked out fine too. 😢 so no smoking gun there as far as why C* could be the culprit to some of this. Regardless hope Kong enables the finish to be safer if it needs to execute multiple times and still achieve 1 ws_id all the tables end up using and have in sync together.
It certainly feels somekind of eventual consistency issue. We now insert the workspace only when after doing read-before-write (AFAIK), but perhaps that read will return nil when the write is done eventually, making us to do write again. And as the default ws id can be different, we cannot just do update.
Is it not possible to add logic into the migrations process for the finish command in code such that if a workspace table has a default record in it, use that record to continue processing? Or would this potentially break other enterprise use cases is the reservation there?
Or maybe thats exactly what you meant here?
We now insert the workspace only when after doing read-before-write (AFAIK), but perhaps that read will return nil when the write is done eventually
But I imagine independent executions of the finish command don't keep up with checking if there was already a ws generated, it probably just generates a fresh one and drops it in based on behavior I have seen?
Maybe the case we do full replication and local quorum r/w and all our Kong C* settings play a roll in this problem but we do so because it's a must to ensure OAuth2.0 bearer token plugin functions correctly for us .
To debug the behavior even if you cant produce this against your own localhost C* cluster:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
I imagine can do manual break points in the code to reproduce the re-entrance of injecting ws id's when finish runs multiple times and stops at a certain portion of its execution like above situations happened for us.
Or maybe thats exactly what you meant here?
Yes, that's what we try to do. During finish
, we first try read the default workspace value to use it, and if we get nothing, we create it. Looks like the SELECT
is returning an empty response.
Maybe the case we do full replication and local quorum r/w and all our Kong C* settings play a roll in this problem but we do so because it's a must to ensure OAuth2.0 bearer token plugin functions correctly for us .
If anything, that should have brought stronger consistency, right?
Puzzling.
I'm wondering if chasing after this isn't a more fruitful avenue:
Also find it interesting how I set the C* schema consensus timeouts long but it bounces back instantly saying consensus not reached? Ex:
2020/09/28 17:23:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)... 2020/09/28 17:23:29 [verbose] Cassandra schema consensus: not reached
Hmm if it was eventual consistency problem though the time between executing the finish multiple times could not be that though :/ ... Pretty sure I could wait 10 mins between each time I run finish and it would still result in the new ws_id's being entered meaning that first select must always be failing somehow for us... odd.
And yeah one obvious red flag to me is how we set the schema consensus timeout huge but does not seem like its being honored. It either snaps back right away due to an underlying error(thats not printed or details given) or it just straight up won't wait for the given time and the wait is being derived from some other value either default or hardcoded.
I believe that it fails here: https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L755-L764
Which means either of these fail: https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L726-L730
It is busy looping, but it seems like it fails quite quick on either of those queries.
As a workaround, one thing is for us to try to reach schema consensus more than once. But as you said, even if you waited you could still get the issue. I also thinking that Cassandra indexes could lie or that some of the queries are not executed on right consistency level. Also could nodes go down while migrating, but you got it 4 times a row, so...
Can you execute those queries:
SELECT schema_version FROM system.local
-- AND
SELECT schema_version FROM system.peers
and share the data when this happens (or is about to happen)?
Feels like this has been discussed several occasions: https://github.com/Kong/kong/issues/4229 https://github.com/Kong/kong/issues/4228 https://discuss.konghq.com/t/kong-migrations-against-a-cassandra-cluster-fails/2660
I left them here just as a pointers.
@jeremyjpj0916, could it be a tombstone related issue: https://stackoverflow.com/a/37119894
Okay, It might be that we talk about two different issues.
We might want to ensure that these: https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L24 https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L43 https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/212_to_213.lua#L24 https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/212_to_213.lua#L43
Runs on stricter consistency. At the moment we are not sure if changing the consistency level fixes it, but it is a good guess.
It feels like it is failing here: https://github.com/Kong/kong/blob/master/kong/db/init.lua#L577
And Cassandra implements it here: https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L396-L415
For some reason it fails sometimes with:
[Read failure] Operation failed - received 0 responses and 1 failures
This error message is coming from Cassandra itself, and is forwarded from here: https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L726-L730
At the moment we don't fully understand why this error happens (tombstones?), but it seems to happen on a rather non-fatal place. We could try it more, e.g. try it for the full max_schema_consensus_wait
(120000ms timeout
), and not fail on the first [Read failure]
.
@jeremyjpj0916 there is now #6411, would it be too much asked if you could try that, and report back if it has any effect on your environment? That branch does not have much more that that compared to latest 2.1.4
. as can be seen here: https://github.com/Kong/kong/compare/2.1.4...fix/cassandra-migrations
Just tried the two commands randomly on a given prod node where I had been testing these migrations on a cloned keyspace, but outside of context of the processing for how kong uses the C* lua client lib tibo wrote not sure if this helps, just shows these commands do generally work in a vacuume:
Can you execute those queries:
SELECT schema_version FROM system.local -- AND SELECT schema_version FROM system.peers
root@server05502:/root
# cqlsh --ssl
Connected to Stargate at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
****@cqlsh> SELECT schema_version FROM system.local;
schema_version
--------------------------------------
12b6ca53-8e84-3a66-8793-7ce091b5dcca
(1 rows)
****@cqlsh> SELECT schema_version FROM system.peers;
schema_version
--------------------------------------
12b6ca53-8e84-3a66-8793-7ce091b5dcca
12b6ca53-8e84-3a66-8793-7ce091b5dcca
12b6ca53-8e84-3a66-8793-7ce091b5dcca
12b6ca53-8e84-3a66-8793-7ce091b5dcca
12b6ca53-8e84-3a66-8793-7ce091b5dcca
(5 rows)
****@cqlsh>
As for tombstones I remember increasing the tombstone fail threshold higher too. Also making the gc policy more aggressive than the default:
ALTER TABLE kong_stage3.ratelimiting_metrics WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.locks WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_tokens WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.locks WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.snis WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.plugins WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.targets WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.consumers WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.upstreams WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.schema_meta WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.response_ratelimiting_metrics WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.tags WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.acls WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.cluster_ca WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.cluster_events WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.jwt_secrets WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.apis WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.basicauth_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.keyauth_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.certificates WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.services WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_authorization_codes WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.routes WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.hmacauth_credentials WITH GC_GRACE_SECONDS = 3600;
Which is 1 hr vs the default 10 days for GC grace. I have forgotten if the default_time_to_live
field impacts anything as well with the gc grace set before then though. Are you familiar? I may need to google around on that if its not optimal currently too.
Ex:
Then the fail threshold was set to:
# When executing a scan, within or across a partition, we need to keep the
# tombstones seen in memory so we can return them to the coordinator, which
# will use them to make sure other replicas also know about the deleted rows.
# With workloads that generate a lot of tombstones, this can cause performance
# problems and even exaust the server heap.
# (http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets)
# Adjust the thresholds here if you understand the dangers and want to
# scan more tombstones anyway. These thresholds may also be adjusted at runtime
# using the StorageService mbean.
tombstone_warn_threshold: 10000
tombstone_failure_threshold: 500000
vs the default of 100,000
What would the migrations be querying each time that would cause that though? The cluster events table get hit each time? I know the oauth2 token table(probably our largest table in terms of records) probably does not get hit on every migrations finish command, especially after its been passed through in the list. I mean I guess could oauth2 tokens table getting in the millions of tombstones within an hour be causing it? OR cluster events_table for that matter? Unsure... not sure how I could ever tune it more than that unless we set it to 0 which is not recommended in C* cluster environments because if a node goes down you start getting "ghost" records or something I have heard since proper consensus can't be achieved via tombstone info. Maybe on a table like oauth2 tokens or cluster_events its less important because the rows have timestamp values that help dictate their validity to kong if they are present in the query or not eh so maybe 0 isn't so bad for some of the more intensely written to tables.
@bungle More than happy to try your latest PR today, won't have results likely till later this evening so in about 10 or so hours due to some restrictions on when I can touch some of our hosts for this testing. Will get that PR of files dropped in so the modified Kong will be ready to go then though. Fingers crossed it yields some positive success.
Appreciate the work you put in trying to resolve this scenario.
Also did do a few queries against our live C* db:
****@cqlsh:kong_prod> select count(*) from cluster_events;
count
-------
87271
(1 rows)
Warnings :
Aggregation query used without partition key
Read 100 live rows and 89643 tombstone cells for query SELECT * FROM kong_prod.cluster_events LIMIT 100 (see tombstone_warn_threshold)
Lotta tombstones here for sure.
As for querying the oauth2_tokens table from one of the nodes via cqlsh... uhh thats probably not good:
****@cqlsh:kong_prod> select count(*) from oauth2_tokens;
OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
Need a longer timeout just to figure things out I suppose.
Honestly pondering if not even tombstoning the cluster events and oauth tokens is probably the play... since that kinda data should not really need to be that in strict consensus(okay for a little mismatch data or temporary ghost records right?) since Kong can see timestamps and valid tokens etc. cluster_events may be more important but I am not sure.. pretty sure most of those events are 1 to 1 with oauth token related invalidations.
yep a little select confirms most of those entries too:
*****@cqlsh:kong_prod> select * from cluster_events limit 10;
channel | at | node_id | id | data | nbf
---------------+---------------------------------+--------------------------------------+--------------------------------------+--------------------------------------------------------+---------------------------------
invalidations | 2020-10-01 19:44:10.123000+0000 | 3ec49684-1e0b-45ac-a13b-33135cc09164 | 6de48d36-18cd-41e9-b431-6f24f0a507f0 | oauth2_tokens:XXX | 2020-10-01 19:44:15.111000+0000
invalidations | 2020-10-01 19:44:10.153000+0000 | 3ec49684-1e0b-45ac-a13b-33135cc09164 | c683d31f-c001-4c45-80ba-47b95780f3f4 | oauth2_tokens:XXX | 2020-10-01 19:44:15.152000+0000
invalidations | 2020-10-01 19:44:10.167000+0000 | 7dccf201-351a-402e-a33f-f1f291063f87 | 407b95f7-05eb-47f8-bc25-689c3065b9bf | oauth2_tokens:XXX | 2020-10-01 19:44:15.167000+0000
invalidations | 2020-10-01 19:44:10.223000+0000 | 21881b66-c99a-470c-ab76-76dcd7fbbc90 | 86180cd0-07ea-43a0-9189-bc63452a817d | oauth2_tokens:XXX | 2020-10-01 19:44:15.222000+0000
invalidations | 2020-10-01 19:44:10.230000+0000 | 7456814f-9464-4602-9a1e-7e2b55e4c4b2 | 1aa7f89d-a98c-4a3b-a62f-cbe8458d2b12 | oauth2_tokens:XXX | 2020-10-01 19:44:15.229000+0000
invalidations | 2020-10-01 19:44:10.256000+0000 | 21881b66-c99a-470c-ab76-76dcd7fbbc90 | 6b5c3f94-db6b-4103-95c1-95b4f1de2700 | oauth2_tokens:XXX | 2020-10-01 19:44:15.244000+0000
..... continued ...
Probably not totally related, but the large amount of invalidation events due to oauth2 does start to clog up the cluster events table in general... :
I would think oauth2 client credential grant type tokens don't even need an "invalidation" event really ever? They are generated and valid for a given ttl, unless you could revoke them early(is that even a thing in the oauth2 CC pattern, Kong supports revoking of client credential bearer tokens? we personally have never used such a feature) I think nodes can query the tokens table and know whats valid/invalid based on the token generation timetamps in the table?? Isn't it just a /token
endpoint to produce valid tokens with the right client_id and client_secret and bearer type provided for oauth2 cc?
Not super familiar either but nodetool cfstats details on that given oauth2_tokens table looks like this in one prod env:
Table: oauth2_tokens
SSTable count: 4
Space used (live): 153678228
Space used (total): 153678228
Space used by snapshots (total): 0
Off heap memory used (total): 2694448
SSTable Compression Ratio: 0.5890186627210159
Number of partitions (estimate): 1614912
Memtable cell count: 5138
Memtable data size: 1268839
Memtable off heap memory used: 0
Memtable switch count: 1288
Local read count: 24577751
Local read latency: 0.067 ms
Local write count: 53152217
Local write latency: 0.109 ms
Pending flushes: 0
Percent repaired: 70.18
Bloom filter false positives: 1
Bloom filter false ratio: 0.00000
Bloom filter space used: 2307288
Bloom filter off heap memory used: 2307256
Index summary off heap memory used: 363776
Compression metadata off heap memory used: 23416
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 179
Compacted partition mean bytes: 128
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.984081041968162
Maximum tombstones per slice (last five minutes): 4
Dropped Mutations: 2052
This is mostly just informational to get an idea of what its like running things like oauth2 plugin in a well used environment.
@jeremyjpj0916 good feedback as always. I will create cards for us to improve oauth2 situation. We have two lines in code:
kong.db.oauth2_authorization_codes:delete({ id = auth_code.id })
and
kong.db.oauth2_tokens:delete({ id = token.id })
There are couple of updates too. The invalidation event is for nodes to clear caches. But let's see if there is a better way.
@bungle
Files all prepped, will be cloning the keyspace again in a few hrs, I noticed you refactored all the core migrations.
Would others such as plugins need a refactor too like from my initial migration logs as well:
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
To gain the benefits you are hoping will occur on the core migration executions?
Such as the connector:query() -> coordinator:execute() ?
Asking this because we saw in my first finish execution the core migrations seemed to have gone off without problem, but it was subsequent individual plugin migrations finish calls that seemed to get stuck 1 by 1:
For reference again:
2020/09/28 17:19:58 [debug] pending migrations to finish:
core: 009_200_to_210, 010_210_to_211
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:19:58 [info] migrating core on keyspace 'kong_prod2'...
2020/09/28 17:19:58 [debug] running migration: 009_200_to_210
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 009_200_to_210 (executed)
2020/09/28 17:20:12 [debug] running migration: 010_210_to_211
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 010_210_to_211 (executed)
2020/09/28 17:20:12 [info] migrating jwt on keyspace 'kong_prod2'...
2020/09/28 17:20:12 [debug] running migration: 003_200_to_210
2020/09/28 17:20:14 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:14 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
Also worth noting odd behavior when cloning the keyspaces I go through and run a script that does some commands, the only table that throws an error is the oauth2 creds table oddly:
error: Unknown column redirect_uri during deserialization
-- StackTrace --
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:321)
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:511)
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385)
at org.apache.cassandra.db.ColumnFamilyStore.loadNewSSTables(ColumnFamilyStore.java:771)
at org.apache.cassandra.db.ColumnFamilyStore.loadNewSSTables(ColumnFamilyStore.java:709)
at org.apache.cassandra.service.StorageService.loadNewSSTables(StorageService.java:5110)
at sun.reflect.GeneratedMethodAccessor519.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1468)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
at sun.reflect.GeneratedMethodAccessor547.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
So the oauth2 creds don't refresh over right away when running the below command(the error above is what happens when I execute the below command):
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_credentials"
So I follow up with a:
Original shell script doing keyspace cloning I am referring to:
#!/bin/bash
cloneKeyspace()
{
local username="${1}"
local start_host="${2}"
local hostnames="${3}"
local keyspace_source="${4}"
local keyspace_destination="${5}"
if ssh ${username}@${start_host} "echo 2>&1"; then
ssh ${username}@${start_host} "cqlsh --ssl -e 'DROP KEYSPACE ${keyspace_destination}'"
fi
sleep 10
for hostname in $(echo $hostnames | sed "s/,/ /g")
do
if ssh ${username}@${hostname} "echo 2>&1"; then
ssh ${username}@${hostname} "rm -rf /data/cassandra/data/${keyspace_destination}"
ssh ${username}@${hostname} "nodetool clearsnapshot -t copy"
fi
done
if ssh ${username}@${start_host} "echo 2>&1"; then
ssh ${username}@${start_host} "cqlsh --ssl -e 'DESCRIBE KEYSPACE ${keyspace_source}' > ${keyspace_source}.txt"
ssh ${username}@${start_host} "sed -i 's/${keyspace_source}/${keyspace_destination}/g' ${keyspace_source}.txt"
ssh ${username}@${start_host} "cqlsh --ssl -f '${keyspace_source}.txt'"
ssh ${username}@${start_host} "rm ${keyspace_source}.txt"
echo "Success on Keyspace Schema clone: "${start_host}
else
echo "Failed on Keyspace Schema clone: "${start_host}
fi
for hostname in $(echo $hostnames | sed "s/,/ /g")
do
if ssh ${username}@${hostname} "echo 2>&1"; then
ssh ${username}@${hostname} "nodetool snapshot -t copy ${keyspace_source}"
echo "Success on Cassandra snapshot: "${hostname} ${keyspace_source}
sleep 20
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/acls-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/acls-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/apis-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/apis-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/ca_certificates-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/ca_certificates-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/certificates-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/certificates-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/cluster_events-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/cluster_events-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/consumers-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/consumers-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/jwt_secrets-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/jwt_secrets-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/keyauth_credentials-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/keyauth_credentials-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/locks-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/locks-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_authorization_codes-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_authorization_codes-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_credentials-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_credentials-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_tokens-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_tokens-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/plugins-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/plugins-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/ratelimiting_metrics-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/ratelimiting_metrics-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/routes-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/routes-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/schema_meta-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/schema_meta-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/services-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/services-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/snis-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/snis-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/tags-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/tags-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/targets-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/targets-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/upstreams-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/upstreams-*/"
ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/workspaces-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/workspaces-*/"
ssh ${username}@${hostname} "nodetool clearsnapshot -t copy"
echo "Success on Cassandra mv of snapshot files to destination: "${keyspace_destination}
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} ratelimiting_metrics"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_tokens"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} locks"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} snis"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} plugins"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} targets"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} consumers"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} upstreams"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} schema_meta"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} tags"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} acls"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} cluster_events"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} workspaces"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} jwt_secrets"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} apis"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} keyauth_credentials"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_credentials"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} certificates"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} services"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} ca_certificates"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_authorization_codes"
ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} routes"
echo "Success on Cassandra nodetool refresh on destination keyspace: "${keyspace_destination}
else
echo "Failed on Cassandra snapshot: "${hostname}
fi
done
if ssh ${username}@${start_host} "echo 2>&1"; then
ssh ${username}@${start_host} "nodetool repair -full ${keyspace_destination}"
echo "Success on Keyspace repair: "${start_host} ${keyspace_destination}
else
echo "Failed on Keyspace repair : "${start_host} ${keyspace_destination}
fi
echo "Script processing completed!"
}
cloneKeyspace "${1}" "${2}" "${3}" "${4}" "${5}"
So to get the creds to import and show up when I query the table properly afterward I do:
***@cqlsh> COPY kong_src.oauth2_credentials TO 'oauth2_credentials.dat';
Using 1 child processes
Starting copy of kong_desto.oauth2_credentials with columns [id, client_id, client_secret, client_type, consumer_id, created_at, hash_secret, name, redirect_uris, tags].
Processed: 81 rows; Rate: 46 rows/s; Avg. rate: 23 rows/s
81 rows exported to 1 files in 3.563 seconds.
****@cqlsh> COPY kong_desto.oauth2_credentials FROM 'oauth2_credentials.dat';
Welp trying it out now as is:
First the migrations up(which so far has never given my grief):
kong migrations up --db-timeout 120 --vv
2020/10/02 02:39:50 [debug] loading subsystems migrations...
2020/10/02 02:39:50 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:50 [verbose] schema state retrieved
2020/10/02 02:39:50 [debug] loading subsystems migrations...
2020/10/02 02:39:50 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:50 [verbose] schema state retrieved
2020/10/02 02:39:50 [debug] migrations to run:
core: 009_200_to_210, 010_210_to_211, 011_212_to_213
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:39:51 [info] migrating core on keyspace 'kong_prod2'...
2020/10/02 02:39:51 [debug] running migration: 009_200_to_210
2020/10/02 02:39:53 [info] core migrated up to: 009_200_to_210 (pending)
2020/10/02 02:39:53 [debug] running migration: 010_210_to_211
2020/10/02 02:39:53 [info] core migrated up to: 010_210_to_211 (pending)
2020/10/02 02:39:53 [debug] running migration: 011_212_to_213
2020/10/02 02:39:53 [info] core migrated up to: 011_212_to_213 (executed)
2020/10/02 02:39:53 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/02 02:39:53 [debug] running migration: 003_200_to_210
2020/10/02 02:39:54 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/10/02 02:39:54 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 003_200_to_210
2020/10/02 02:39:54 [info] acl migrated up to: 003_200_to_210 (pending)
2020/10/02 02:39:54 [debug] running migration: 004_212_to_213
2020/10/02 02:39:54 [info] acl migrated up to: 004_212_to_213 (pending)
2020/10/02 02:39:54 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 004_200_to_210
2020/10/02 02:39:54 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/10/02 02:39:54 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 004_200_to_210
2020/10/02 02:39:56 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/10/02 02:39:56 [debug] running migration: 005_210_to_211
2020/10/02 02:39:56 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/10/02 02:39:56 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:39:59 [verbose] Cassandra schema consensus: reached in 3049ms
2020/10/02 02:39:59 [info] 9 migrations processed
2020/10/02 02:39:59 [info] 2 executed
2020/10/02 02:39:59 [info] 7 pending
2020/10/02 02:39:59 [debug] loading subsystems migrations...
2020/10/02 02:39:59 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:59 [verbose] schema state retrieved
2020/10/02 02:39:59 [info]
Database has pending migrations; run 'kong migrations finish' when ready
Welp migrations finish more problems again, the good news was the 120 seconds was respected it seems, but just never agreed on anything there, look at how fast those core migrations reached consensus too hah(3ms), the other one just hangs :/ :
kong migrations finish --db-timeout 120 --vv
The environment variables are the same from my original post so cutting those out for brevity.
2020/10/02 02:42:24 [debug] loading subsystems migrations...
2020/10/02 02:42:24 [verbose] retrieving keyspace schema state...
2020/10/02 02:42:24 [verbose] schema state retrieved
2020/10/02 02:42:24 [debug] loading subsystems migrations...
2020/10/02 02:42:24 [verbose] retrieving keyspace schema state...
2020/10/02 02:42:25 [verbose] schema state retrieved
2020/10/02 02:42:25 [debug] pending migrations to finish:
core: 009_200_to_210, 010_210_to_211
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:42:25 [info] migrating core on keyspace 'kong_prod2'...
2020/10/02 02:42:25 [debug] running migration: 009_200_to_210
2020/10/02 02:42:37 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:42:37 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:42:37 [info] core migrated up to: 009_200_to_210 (executed)
2020/10/02 02:42:37 [debug] running migration: 010_210_to_211
2020/10/02 02:42:37 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:42:37 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:42:37 [info] core migrated up to: 010_210_to_211 (executed)
2020/10/02 02:42:37 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/02 02:42:37 [debug] running migration: 003_200_to_210
2020/10/02 02:42:38 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:44:38 [verbose] Cassandra schema consensus: not reached in 120095ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
Going to execute kong migrations finish multiple times until finish "completes" and will check the keyspace.
Second finish, again it did wait the 120 seconds, but didn't complete successfully it seems to reach a consensus:
2020/10/02 02:47:53 [debug] loading subsystems migrations...
2020/10/02 02:47:53 [verbose] retrieving keyspace schema state...
2020/10/02 02:47:53 [verbose] schema state retrieved
2020/10/02 02:47:53 [debug] loading subsystems migrations...
2020/10/02 02:47:53 [verbose] retrieving keyspace schema state...
2020/10/02 02:47:53 [verbose] schema state retrieved
2020/10/02 02:47:53 [debug] pending migrations to finish:
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:47:53 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:47:53 [debug] running migration: 003_200_to_210
2020/10/02 02:48:00 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:50:00 [verbose] Cassandra schema consensus: not reached in 120002ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
3rd :
2020/10/02 02:51:31 [debug] loading subsystems migrations...
2020/10/02 02:51:31 [verbose] retrieving keyspace schema state...
2020/10/02 02:51:32 [verbose] schema state retrieved
2020/10/02 02:51:32 [debug] loading subsystems migrations...
2020/10/02 02:51:32 [verbose] retrieving keyspace schema state...
2020/10/02 02:51:32 [verbose] schema state retrieved
2020/10/02 02:51:32 [debug] pending migrations to finish:
acl: 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:51:32 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:51:32 [debug] running migration: 004_212_to_213
2020/10/02 02:51:32 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:53:32 [verbose] Cassandra schema consensus: not reached in 120024ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
4th, which oddly succeeded the rest of the way???
2020/10/02 02:54:41 [debug] loading subsystems migrations...
2020/10/02 02:54:41 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:41 [verbose] schema state retrieved
2020/10/02 02:54:41 [debug] loading subsystems migrations...
2020/10/02 02:54:41 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:41 [verbose] schema state retrieved
2020/10/02 02:54:41 [debug] pending migrations to finish:
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:54:41 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/02 02:54:41 [debug] running migration: 004_200_to_210
2020/10/02 02:54:43 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:54:43 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:54:43 [info] oauth2 migrated up to: 004_200_to_210 (executed)
2020/10/02 02:54:43 [debug] running migration: 005_210_to_211
2020/10/02 02:54:44 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:54:44 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:54:44 [info] oauth2 migrated up to: 005_210_to_211 (executed)
2020/10/02 02:54:44 [info] 2 migrations processed
2020/10/02 02:54:44 [info] 2 executed
2020/10/02 02:54:44 [debug] loading subsystems migrations...
2020/10/02 02:54:44 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:44 [verbose] schema state retrieved
2020/10/02 02:54:44 [info] No pending migrations to finish
And a migrations list command after:
2020/10/02 02:57:29 [debug] loading subsystems migrations...
2020/10/02 02:57:29 [verbose] retrieving keyspace schema state...
2020/10/02 02:57:30 [verbose] schema state retrieved
2020/10/02 02:57:30 [info] Executed migrations:
core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213
jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210
oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211
And a finish command again just to ensure it says no more migrations:
2020/10/02 02:58:12 [debug] loading subsystems migrations...
2020/10/02 02:58:12 [verbose] retrieving keyspace schema state...
2020/10/02 02:58:12 [verbose] schema state retrieved
2020/10/02 02:58:13 [debug] loading subsystems migrations...
2020/10/02 02:58:13 [verbose] retrieving keyspace schema state...
2020/10/02 02:58:13 [verbose] schema state retrieved
2020/10/02 02:58:13 [info] No pending migrations to finish
Time to check the keyspace and see if we got a bunch of workspacs injected in there. It is concerning still I execute multiple times w failures but ultimately if it actually did the right changes I am okay running it multiple times 😆 .
Well some excitement here! It seems I am no longer getting the duplicated workspace id's in the table!!!
[cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
*****@cqlsh> use kong_prod2;
*****@cqlsh:kong_prod2> describe tables;
ratelimiting_metrics response_ratelimiting_metrics oauth2_credentials
oauth2_tokens tags certificates
locks acls services
snis cluster_events ca_certificates
plugins workspaces oauth2_authorization_codes
targets jwt_secrets routes
consumers apis hmacauth_credentials
upstreams basicauth_credentials
schema_meta keyauth_credentials
*****@cqlsh:kong_prod2> select * from workspaces;
id | comment | config | created_at | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
c709a0a2-687f-43c3-a46a-7bd437dc5c24 | null | null | 2020-10-02 02:42:24.000000+0000 | null | default
(1 rows)
kongdba@cqlsh:kong_prod2>
Lets go ahead and verify other tables are using this id all over the place.
ACLS, lgtm with respect to ws id and how its appended:
***@cqlsh:kong_prod2> select * from acls limit 5;
id | cache_key | consumer_id | created_at | group | tags | ws_id
--------------------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------------------------------+--------------------------------------+------+--------------------------------------
187d0487-176b-454c-8a24-9364eafa13e3 | acls:34fe7bbd-7470-4b02-9886-2f0445900ec6:d4d07b18-35d7-402a-a65a-e0f77950de97::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 34fe7bbd-7470-4b02-9886-2f0445900ec6 | 2019-12-16 08:01:59.000000+0000 | d4d07b18-35d7-402a-a65a-e0f77950de97 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
753c13ec-e781-4034-9126-3170c10144e9 | acls:97b8b271-da5e-4aff-afca-bfa0c1dad70f:555bd712-a309-414e-9145-1b3af9a79559::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 97b8b271-da5e-4aff-afca-bfa0c1dad70f | 2020-06-04 19:56:42.000000+0000 | 555bd712-a309-414e-9145-1b3af9a79559 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
4e9a2da9-b453-4c21-899f-3768725bfb6c | acls:aee2dab7-755d-4900-80f6-7c795ae0c51e:a56539e7-b061-48b5-a86b-646aab4548e8::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | aee2dab7-755d-4900-80f6-7c795ae0c51e | 2020-03-27 20:17:44.000000+0000 | a56539e7-b061-48b5-a86b-646aab4548e8 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
ef8a963f-5cbb-4065-9c54-51e85372c50a | acls:39af1af3-4720-46a6-ad3a-13607e835e3f:d7fd7390-d872-47aa-8eba-c1e436ccd08c::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 39af1af3-4720-46a6-ad3a-13607e835e3f | 2020-01-07 11:14:18.000000+0000 | d7fd7390-d872-47aa-8eba-c1e436ccd08c | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
b75330ea-8adc-4124-8406-ed9968115cc3 | acls:191c101d-1394-41c2-8e2a-520cbfb0d505:e2b94d6e-17d3-4757-9022-441a4b28e534::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 191c101d-1394-41c2-8e2a-520cbfb0d505 | 2020-01-13 11:44:06.000000+0000 | e2b94d6e-17d3-4757-9022-441a4b28e534 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
(5 rows)
kongdba@cqlsh:kong_prod2>
ROUTES lgtm:
*****@cqlsh:kong_prod2> select * from routes limit 1;
partition | id | created_at | destinations | headers | hosts | https_redirect_status_code | methods | name | path_handling | paths | preserve_host | protocols | regex_priority | service_id | snis | sources | strip_path | tags | updated_at | ws_id
-----------+--------------------------------------+---------------------------------+--------------+---------+-------+----------------------------+---------+------+---------------+--------------------------+---------------+-------------------+----------------+--------------------------------------+------+---------+------------+------+---------------------------------+--------------------------------------
routes | 001eeede-9d8c-44fa-8737-e3f07ff9c876 | 2020-01-13 11:44:04.000000+0000 | null | null | null | 426 | null | null | v1 | ['/some/path/v1'] | False | {'http', 'https'} | 0 | 8afa2fa1-29f5-4337-9b92-5d01814916de | null | null | True | null | 2020-01-13 11:44:04.000000+0000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24
(1 rows)
****@cqlsh:kong_prod2>
Services lgtm:
****@cqlsh:kong_prod2> select * from services limit 1;
partition | id | ca_certificates | client_certificate_id | connect_timeout | created_at | host | name | path | port | protocol | read_timeout | retries | tags | tls_verify | tls_verify_depth | updated_at | write_timeout | ws_id
-----------+--------------------------------------+-----------------+-----------------------+-----------------+---------------------------------+----------------+-------------------------------------------------------------------+----------------------------------------------+------+----------+--------------+---------+------+------------+------------------+---------------------------------+---------------+--------------------------------------
services | 005a0644-06b4-4b2d-8eca-6dec316f0113 | null | null | 2000 | 2019-05-23 21:21:04.000000+0000 | service.company.com | c709a0a2-687f-43c3-a46a-7bd437dc5c24:c360.prod.service.here.v5 | /api/my/prod/url/v5.0/search | 443 | https | 9000 | 0 | null | null | null | 2019-05-23 21:21:04.000000+0000 | 9000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24
(1 rows)
******@cqlsh:kong_prod2>
jwt_secrets, LGTM:
****@cqlsh:kong_prod2> select * from jwt_secrets limit 1;
id | algorithm | consumer_id | created_at | key | rsa_public_key | secret | tags | ws_id
--------------------------------------+-----------+--------------------------------------+---------------------------------+-----------------------------------------------------------------------+----------------+----------------------------------+------+--------------------------------------
b8fb26c4-ec3d-4d38-be5b-f1cc0e27aa78 | HS256 | 623021d2-dc20-4402-be44-582607a8fcb3 | 2019-06-05 05:44:10.000000+0000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24:XXXXXXX | null | XXXXXXXX | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
(1 rows)
*****@cqlsh:kong_prod2>
oauth2_credentials, LGTM:
****@cqlsh:kong_prod2> select * from oauth2_credentials limit 1;
id | client_id | client_secret | client_type | consumer_id | created_at | hash_secret | name | redirect_uris | tags | ws_id
--------------------------------------+-----------------------------------------------------------------------+----------------------------------+-------------+--------------------------------------+---------------------------------+-------------+-------------+----------------------+------+--------------------------------------
cf2f541c-7864-4be8-a9c8-cc1b57d50399 | c709a0a2-687f-43c3-a46a-7bd437dc5c24:XXXXXX | XXXXXX | null | XXXXXXX | 2018-08-23 04:49:23.966000+0000 | null | OAuth_Creds | {'http://company.com'} | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
(1 rows)
****@cqlsh:kong_prod2>
Consumers, LGTM:
*****@cqlsh:kong_prod2> select * from consumers limit 1;
id | created_at | custom_id | tags | username | ws_id
--------------------------------------+---------------------------------+-----------+------+-------------------------------------------+--------------------------------------
4a7a7568-589a-42f6-a488-639dfb1ec21a | 2020-05-12 21:11:13.000000+0000 | null | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24:username1 | c709a0a2-687f-43c3-a46a-7bd437dc5c24
(1 rows)
*****@cqlsh:kong_prod2>
oauth2_tokens, no example records there sadly hmm maybe they are already garbage collected after my migration and this detective work:
*****@cqlsh:kong_prod2> select * from oauth2_tokens limit 1;
id | access_token | authenticated_userid | created_at | credential_id | expires_in | refresh_token | scope | service_id | token_type | ws_id
----+--------------+----------------------+------------+---------------+------------+---------------+-------+------------+------------+-------
(0 rows)
*****@cqlsh:kong_prod2>
Will likely run again and check this specific table too by itself after execution of this whole process over again to ensure any existing tokens get proper set ^ .
cluster_events:
****@cqlsh:kong_prod2> select * from cluster_events limit 1;
channel | at | node_id | id | data | nbf
---------+----+---------+----+------+-----
(0 rows)
****@cqlsh:kong_prod2>
Not modified with a ws_id in table structure huh???? Hmm.
And lastly plugin example, looks good to me:
*****@cqlsh:kong_prod2> select * from plugins limit 1;
id | api_id | cache_key | config | consumer_id | created_at | enabled | name | protocols | route_id | run_on | service_id | tags | ws_id
--------------------------------------+--------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-------------+---------------------------------+---------+------+-----------+--------------------------------------+--------+------------+------+--------------------------------------
638b8470-fc50-45f7-964e-2a6155f32c72 | null | plugins:acl:fb77c07a-4f9f-408f-969c-bdf447aae0ad::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | {"deny":null,"allow":["fb77c07a-4f9f-408f-969c-bdf447aae0ad"],"hide_groups_header":true} | null | 2019-01-28 18:24:57.000000+0000 | True | acl | null | fb77c07a-4f9f-408f-969c-bdf447aae0ad | first | null | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
(1 rows)
*****@cqlsh:kong_prod2>
Edit:
So oauth2_tokens is where I am unsure cause I just have no record data to ensure the migration woulda worked for that table right, likely cause the tokens expired during that time I guess, if the token data was somehow not cloned over to my temp keyspace ill need to check that too as well. Also the cluster_events not having a ws_id seems odd to me but guess its how the design is for now(i would think if kong cp's ever have segregations then the ws_id has to be integrated with cluster_events as well with my novice speculating thoughts 😆 ).
So still we don't seem to have a fix for some of the finish commands not able to read and reach a consensus(1 failure). Could it be related to some tables with 1000's of records and paging again??? Like runtime kong C* paging was fixed by that one guy. But could the migrations processes be using wrong paging calls and hitting the failures due to some variable scope problem?
But the good news is it seems now no dup workspaces AND the tables that needed to be modified got modified and modified correctly eh? So seems we are getting positive results here.
I also have 2 environments like this, so I tested the migrations on 1, and am leaving one untouched on the cloned keyspace for now, if you have any further tweaks or logging you want me to add to the files I can as well for a second test as well.
@jeremyjpj0916,
So it really looks that when this happens, it stays like that until you try again. So now I am thinking that should we close connection before we check the schema consensus. The error does not happen on migrations, but it happens on code that is ran between the migrations (which is less worrying). Good to know that multiple ws issue seems to be fixed.
@bungle right, having to execute the finish multiple times is sub-optimal in current state, at least for our use case. Yeah I think honoring the set schema consensus timeout was still a good change, but ultimately I will set that way lower than 120 even because it seems to not be the problem w my cluster, the wait isn't helping any. Do you have a snippit to drop into one of these files that will close the connection and re-open it for checking schema consensus as a test to confirm your suspicions? I can use that on our second prod environment that has a cloned keyspace ready for upgrading 😁 . The perfect fix for me would be one that runs all the way through without me having to execute the finish over and over.
Then after that I will probably do this clone keyspace and upgrade thing like 10-20 times in a row to feel really good that Kong will be stable on the jump from 2.0.5 to 2.1.x(whatever version this ends up 2.1.5?) .
Also important I am sure that any alterations to the migration process while making it work better for a multi DC 6 node c* cluster w high replication and local_q read/write AND still need to be applicable to function okay against a single node etc. As shown in these lastest 2.1.x upgrades it can sometimes be a challenge to make something that fits for best of both worlds it seems. Just for stats these environments have a few thousand service/route resources, like 5k or so plugins and then plenty of oauth2 tokens and a number of acls. Not super large db'es in terms of data size but decent at this point.
But hey at least the changes thus far made the finish command idempotent (@rsbrisci fancy big word, not mine). So the ws id alterations look to be correct!
@jeremyjpj0916, is there any errors logged to Cassandra when this happens? Or close to when it happens?
@bungle will check on this during the work day and get back to yah on if I see anything interesting in the C* logs. Probably should have been doing exactly that when performing these upgrades but it did slip my mind.
Just gonna break down the logs I can see of the node I believe we would have hit for migrations.
First up:
From server05503 looking on the 10/1 date.
/var/log/cassandra/cassandra.log:
Only error on 10/1 came from the keyspace cloning error from my shell script copying oauth2_credentials over(which I end up doing manually using copy tables commands as described earlier):
ERROR [SSTableBatchOpen:1] 2020-10-01 22:29:35,338 CassandraDaemon.java:235 - Exception in thread Thread[SSTableBatchOpen:1,5,main]
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:338) ~[apache-cassandra-3.11.8.jar:3.11.8]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:522) ~[apache-cassandra-3.11.8.jar:3.11.8]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385) ~[apache-cassandra-3.11.8.jar:3.11.8]
at org.apache.cassandra.io.sstable.format.SSTableReader$3.run(SSTableReader.java:570) ~[apache-cassandra-3.11.8.jar:3.11.8]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_232]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.8.jar:3.11.8]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]
Next up:
/var/log/cassandra/system.log
ERROR [Reference-Reaper:1] 2020-10-01 20:36:03,866 Ref.java:224 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@69a76445) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1838738857:/data/cassandra/data/kong_prod2/oauth2_credentials-feb3d290044e11ebafdba3cdf7a167d7/mc-1-big was not released before the reference was garbage collected
ERROR [SSTableBatchOpen:1] 2020-10-01 22:29:35,338 CassandraDaemon.java:235 - Exception in thread Thread[SSTableBatchOpen:1,5,main]
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:338) ~[apache-cassandra-3.11.8.jar:3.11.8]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:522) ~[apache-cassandra-3.11.8.jar:3.11.8]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385) ~[apache-cassandra-3.11.8.jar:3.11.8]
at org.apache.cassandra.io.sstable.format.SSTableReader$3.run(SSTableReader.java:570) ~[apache-cassandra-3.11.8.jar:3.11.8]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_232]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.8.jar:3.11.8]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]
But both of those logs were prior to the kong migrations finish command that would have ran... those would have been 1 during the keyspace copy for the 2nd error statement, similar to the one found in cassandra.log, then the first error was 2 hours before that and unrelated to the migrations effort. Unless the migrations hit a different node and I need to check each one of the nodes independently, doesn't seem like I can tell which node was used to handle the migrations work in debug prints(unless I removed those lines up above from the logs thinking they would not be relevant prior to posting).
Gonna check the other nodes too.
server05504 had this in /var/log/cassandra/system.log
ERROR [MessagingService-Incoming-/10.204.90.234] 2020-10-01 22:30:41,625 CassandraDaemon.java:228 - Exception in thread Thread[MessagingService-Incoming-/10.204.90.234,5,main]
java.lang.RuntimeException: Unknown column kind during deserialization
at org.apache.cassandra.db.Columns$Serializer.deserialize(Columns.java:452) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.SerializationHeader$Serializer.deserializeForMessaging(SerializationHeader.java:412) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.deserializeHeader(UnfilteredRowIteratorSerializer.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:835) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:823) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:669) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:652) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.2.jar:3.11.2]
ERROR [MessagingService-Incoming-/10.204.90.234] 2020-10-01 22:30:42,569 CassandraDaemon.java:228 - Exception in thread Thread[MessagingService-Incoming-/10.204.90.234,5,main]
java.lang.RuntimeException: Unknown column kind during deserialization
at org.apache.cassandra.db.Columns$Serializer.deserialize(Columns.java:452) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.SerializationHeader$Serializer.deserializeForMessaging(SerializationHeader.java:412) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.deserializeHeader(UnfilteredRowIteratorSerializer.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:835) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:823) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:669) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:652) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.2.jar:3.11.2]
But again I believe thats not even during the migrations executions but likely during keyspace clone related stuff because of timestamp aligning with the redirect_uri error from other server above(granted I don't remember the "kind" error showing up in output prior, only the redirect_uri one associated with oauth2 creds on keyspace clone and data copy over)...
Checking other final node in this DC for local quorum oddities. EDIT - Nope 0 logs extra there....
Some extra coordinator settings in my cassandra.yaml from a given node:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 10000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 5000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 10000
@bungle so the fix that prevents the dup workspaces is going into 2.2.0 now and not a 2.1.5 patch correct? So I should wait and upgrade my prod nodes potentially from 2.0.5 to 2.2.0 to not run into the ws migration errors I have been seeing(even if we can't get to the bottom of why the finish does not execute smoothly throughout for us and we have to run finish multiple times).
@jeremyjpj0916 the removal of migration helpers and passing of connection was considered breaking, so it could not go to 2.1.5 anymore. In my opinion that is overly cautious, but I guess better to be safe than sorry. Overly cautious in that it could only break custom plugins that use the undocumented helpers (from which we have already removed functions in a past without a notice... I just saw that it was not used anywhere, thus removed it and instead passed the connection/coordinator to teardown). Singature before (connector, undocumented_helpers)
, signature now (connector, connection/coordinator)
.
Perhaps instead of connector, connection/coordinator
we should just pass migration_executor
. But that is a far bigger change and even more breaking.
These commits are the problematic for patch: https://github.com/Kong/kong/pull/6411/commits/998d13efb14b11290abd411397ed051d2208e26b https://github.com/Kong/kong/pull/6411/commits/ed9f373d94ea5ce98d250319bd390093812368ed
We could perhaps try to backport or modify pr somehow to the workspace prevention code to patch. The wait for schema consensus is still something I am not sure if that is fixed with e.g. passing of the coordinator. Still it somehow feels like the connection becomes unstable. after running some migrations and then it cannot be used anymore to call wait for schema consensus. Perhaps we should not use connector there either, but just call coordinator. But then, what if that coordinator becomes unstable. I have tried to track down if any of our code calls :setkeepalive
or :close
to connection that is grabbed before we call wait_for_schema_consensus
but I could not find anything.
Restarting finnish surely makes all the internal structures clean, and grabs a new connection... but still I do not understand why connection goes bad (can it become bad because of network issue or something, or after running too many migrations).
@bungle
Gotcha, I suppose I can also just roll with 2.1.5 with the exact files you had me test with your pending PR and roll that up through prod then end up moving onto 2.2.x at a later time as well, otherwise the jumping will get bigger and I would need to vet 2.2.x for problems and I think 2.0.5 to 2.1.5 is a big enough jump as is for prod.
Yeah was hoping C logs might reveal a problem, but so far no dice, if Kong team gets any ideas for a snippet of a debug print line either in Kong layer or in the underlying C client lib that would be insightful to help get an RCA let me know and I will give that a test run too.
@jeremyjpj0916, the PR I did was not mergeable to anywhere ;-(. So back to drawing board, here is a fix that can go to 2.1.5
and to 2.2.0
: https://github.com/Kong/kong/pull/6439
Were these the same changes ultimately or a retest would be helpful to you here?
@jeremyjpj0916,
If you have time to try out #6439 and give us feedback it would be great. I do expect the multiple default workspace issue be solved (there are two fixes to that: 1. try to create it only once, 2. use proper consistency when creating and reading). The other issue is trickier. I found a couple of new places, but ultimately I don't think I found anything that could solve it. I didn't do reconnect because it feels like a workaround that just hides a possible issue that needs to be fixed. The biggest change is that we never call connect_migrations
on migrations anymore (only get_stored_connection
), and let framework to handle the connections. As always thanks for cooperating on this, all the feedback is welcomed. I had to rework the PR a bit so it is also good to give us feedback if something went wrong with the re-implementation (I tried to be careful).
Still it feels like the connection gets corrupted at some point. But it is strange where it does so: migration ok, recording migration ok, but waiting for schema consensus not ok). Perhaps it is a Cassandra server that closes connection. Something in lines of this: https://stackoverflow.com/questions/39343475/how-does-cassandra-server-identifies-crashed-client-connections. Could it be that there is some migrations that make Cassandra think that client has crashed or something.
One thing to note, on your last try the jwt
failed once and acl
plugin failed twice. Some notes about ACL migration:
uniques = {}
Anything else in your migrations went ok (on that last try).
The only thing I think that those migrations do is that they run UPDATE
s while iterating the table. So could it be that iterating goes wrong when you have enough entities if you at the same time modify that table a lot. @jeremyjpj0916, I'll create you a separate branch that splits iterate update
to two loops, so let's see if that makes a difference.
@jeremyjpj0916,
this branch has one additional commit: https://github.com/Kong/kong/commits/fix/migrations-master-iteration-separated
That moves updates out of iteration. This does it only for those migrations that you saw problematic. There are older migrations that have similar pattern of update inside iterator. If you could do a test run with #6439 and fix/migrations-master-iteration-separated
. All results are good as they close possibilities.
Gotcha, will give both of those a go and let you know what I see. I am revamping our keyspace migration scripting to be able to pull down that keyspace and data and clone it on a given different env of infra (been cloning on prod infra and we are in peak season so it seems wrong to do so haha. Gonna use ansible and push and pull some files around for my v2 iteration). Should have some info later this week or early next week from testings.
Thanks again for your diligence @bungle .
Rewrote my copy keyspace as an Ansible script that can send it from a src C cluster node to a totally different C cluster node(assuming the keyspace does not already exist, no checks or anything fancy here):
---
- hosts: src_cassandra_node
become: true
tasks:
- name: Make data directory for the clone files
file:
path: /data/cassandra/clone
state: directory
- name: Move into the C* folder we intend to put all our files and Export keyspace schema
command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e 'DESCRIBE KEYSPACE {{ KEYSPACE_SOURCE_NAME }}' > {{ KEYSPACE_SOURCE_NAME }}.txt"
- name: Alter keyspace schema name for later import in other cluster
command: bash -c "cd /data/cassandra/clone && sed -i 's/{{ KEYSPACE_SOURCE_NAME }}/{{ KEYSPACE_DESTINATION_NAME }}/g' {{ KEYSPACE_SOURCE_NAME }}.txt"
- name: Export table data into logical files for import later
command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_SOURCE_NAME }}; describe tables;\" | tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_SOURCE_NAME }}.% TO '%.dat'\""
register: export_table_data
- debug: msg="{{ export_table_data.stdout_lines }}"
- name: Make local sure local directory is empty
local_action: file path=/data/cassandra/clone state=absent
- name: Make local temp local directory for cassandra files on controller
local_action: command mkdir -p /data/cassandra/clone
- shell: (cd /data/cassandra/clone; find . -maxdepth 1 -type f) | cut -d'/' -f2
register: files_to_copy
- fetch: src=/data/cassandra/clone/{{ item }} dest=/data/cassandra/clone/ flat=yes
with_items: "{{ files_to_copy.stdout_lines }}"
- name: Remove files and directory from remote host
command: rm -rf /data/cassandra/clone
- hosts: dest_cassandra_node
become: true
tasks:
- name: Remove files and directory from remote host
command: rm -rf /data/cassandra/clone
- name: Make data directory for the clone files
file:
path: /data/cassandra/clone
state: directory
- name: Move Cassandra files to destination C* node
copy:
src: /data/cassandra/clone/
dest: /data/cassandra/clone
owner: cassandra
group: cassandra
mode: 0755
- name: Make local sure local ansible controller directory is emptied afterward
local_action: file path=/data/cassandra/clone state=absent
- name: Build new keyspace and schema
command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -f '{{ KEYSPACE_SOURCE_NAME }}.txt'"
- name: Import data to new keyspace
command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_DESTINATION_NAME }}; describe tables;\" | tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_DESTINATION_NAME }}.% FROM '%.dat'\""
register: import_table_data
- debug: msg="{{ import_table_data.stdout_lines }}"
- name: Repair newly made clone keyspace for good measure
command: bash -c "nodetool repair -full {{ KEYSPACE_DESTINATION_NAME }}"
- name: Clean out temp keyspace files and data on remote C* destination
command: rm -rf /data/cassandra/clone
I will be resuming the new bungle PR testing here shortly tomorrow and into the weekend/Monday!
Much faster execution than my first go shell script hack too.
First go at #6439 PR against my cloned prod 2.0.5 keyspace and its data. Made sure I had some oauth2 tokens present and valid in the db this time too.
kong migrations list --vv
2020/10/13 05:12:30 [debug] loading subsystems migrations...
2020/10/13 05:12:30 [verbose] retrieving keyspace schema state...
2020/10/13 05:12:30 [verbose] schema state retrieved
2020/10/13 05:12:30 [info] Executed migrations:
core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200
jwt: 000_base_jwt, 002_130_to_140
acl: 000_base_acl, 002_130_to_140
rate-limiting: 000_base_rate_limiting, 003_10_to_112
oauth2: 000_base_oauth2, 003_130_to_140
2020/10/13 05:12:30 [info]
New migrations available:
core: 009_200_to_210, 010_210_to_211, 011_212_to_213
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:12:30 [info]
Run 'kong migrations up' to proceed
kong migrations up --db-timeout 20 --vv
2020/10/13 05:13:55 [debug] loading subsystems migrations...
2020/10/13 05:13:55 [verbose] retrieving keyspace schema state...
2020/10/13 05:13:55 [verbose] schema state retrieved
2020/10/13 05:13:56 [debug] loading subsystems migrations...
2020/10/13 05:13:56 [verbose] retrieving keyspace schema state...
2020/10/13 05:13:56 [verbose] schema state retrieved
2020/10/13 05:13:56 [debug] migrations to run:
core: 009_200_to_210, 010_210_to_211, 011_212_to_213
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:13:56 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:13:56 [debug] running migration: 009_200_to_210
2020/10/13 05:14:02 [info] core migrated up to: 009_200_to_210 (pending)
2020/10/13 05:14:02 [debug] running migration: 010_210_to_211
2020/10/13 05:14:02 [info] core migrated up to: 010_210_to_211 (pending)
2020/10/13 05:14:02 [debug] running migration: 011_212_to_213
2020/10/13 05:14:02 [info] core migrated up to: 011_212_to_213 (executed)
2020/10/13 05:14:02 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/13 05:14:02 [debug] running migration: 003_200_to_210
2020/10/13 05:14:03 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/10/13 05:14:03 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 003_200_to_210
2020/10/13 05:14:03 [info] acl migrated up to: 003_200_to_210 (pending)
2020/10/13 05:14:03 [debug] running migration: 004_212_to_213
2020/10/13 05:14:03 [info] acl migrated up to: 004_212_to_213 (pending)
2020/10/13 05:14:03 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 004_200_to_210
2020/10/13 05:14:03 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/10/13 05:14:03 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 004_200_to_210
2020/10/13 05:14:05 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/10/13 05:14:05 [debug] running migration: 005_210_to_211
2020/10/13 05:14:05 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/10/13 05:14:05 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/13 05:14:07 [verbose] Cassandra schema consensus: reached in 2537ms
2020/10/13 05:14:07 [info] 9 migrations processed
2020/10/13 05:14:07 [info] 2 executed
2020/10/13 05:14:07 [info] 7 pending
2020/10/13 05:14:07 [debug] loading subsystems migrations...
2020/10/13 05:14:07 [verbose] retrieving keyspace schema state...
2020/10/13 05:14:08 [verbose] schema state retrieved
2020/10/13 05:14:08 [info]
Database has pending migrations; run 'kong migrations finish' when ready
/ $
First run of finish: kong migrations finish --db-timeout 20 --vv
2020/10/13 05:15:11 [debug] loading subsystems migrations...
2020/10/13 05:15:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:11 [verbose] schema state retrieved
2020/10/13 05:15:11 [debug] loading subsystems migrations...
2020/10/13 05:15:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:11 [verbose] schema state retrieved
2020/10/13 05:15:11 [debug] pending migrations to finish:
core: 009_200_to_210, 010_210_to_211
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:15:12 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:15:12 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
Second run of finish: kong migrations finish --db-timeout 20 --vv
2020/10/13 05:15:53 [debug] loading subsystems migrations...
2020/10/13 05:15:53 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:53 [verbose] schema state retrieved
2020/10/13 05:15:53 [debug] loading subsystems migrations...
2020/10/13 05:15:53 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:54 [verbose] schema state retrieved
2020/10/13 05:15:54 [debug] pending migrations to finish:
core: 009_200_to_210, 010_210_to_211
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:15:54 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:15:54 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
Third run of finish: kong migrations finish --db-timeout 20 --vv
2020/10/13 05:16:29 [debug] loading subsystems migrations...
2020/10/13 05:16:29 [verbose] retrieving keyspace schema state...
2020/10/13 05:16:30 [verbose] schema state retrieved
2020/10/13 05:16:30 [debug] loading subsystems migrations...
2020/10/13 05:16:30 [verbose] retrieving keyspace schema state...
2020/10/13 05:16:30 [verbose] schema state retrieved
2020/10/13 05:16:30 [debug] pending migrations to finish:
core: 009_200_to_210, 010_210_to_211
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:16:31 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:16:31 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
Fourth run of finish: kong migrations finish --db-timeout 20 --vv
2020/10/13 05:17:11 [debug] loading subsystems migrations...
2020/10/13 05:17:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:17:12 [verbose] schema state retrieved
2020/10/13 05:17:12 [debug] loading subsystems migrations...
2020/10/13 05:17:12 [verbose] retrieving keyspace schema state...
2020/10/13 05:17:12 [verbose] schema state retrieved
2020/10/13 05:17:12 [debug] pending migrations to finish:
core: 009_200_to_210, 010_210_to_211
jwt: 003_200_to_210
acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:17:12 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:17:12 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
[C]: in function 'assert'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
[C]: in function 'pcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
[C]: in function 'error'
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
[C]: in function 'xpcall'
/usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
/usr/bin/kong:9: in function 'file_gen'
init_worker_by_lua:50: in function <init_worker_by_lua:48>
[C]: in function 'xpcall'
init_worker_by_lua:57: in function <init_worker_by_lua:55>
Umm @bungle its now erring every time on trying to find a default workspace. Well lets check the keyspace:
# cqlsh --ssl
Connected to Stargate at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.8 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
*****@cqlsh> use kong_prod2;
*****@cqlsh:kong_prod2> select * from workspaces;
id | comment | config | created_at | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
926762fb-7c62-4ddb-8535-5577af0cefac | null | null | 2020-10-13 05:15:11.000000+0000 | null | default
(1 rows)
*****@cqlsh:kong_prod2>
Def was injected at 05:15:11 by the first migrations finish I imagine, no idea why the rest of the processing and subsequent finish commands are not seeing it.
Did a diff on 009_200_to_210
core migration file just to make sure your patch file was indeed in there and identical and it was. Seems whatever logic was dropped in this smaller PR caused it to not be as effective or introduced a new bug preventing the finding of the injected workspace.
Will hold off on the other pending branch you have until you tell me to test it if you feel 6439 needs more attention. If you want me to go ahead with your other experimental branch with changes I can do so as well, let me know how you prefer to proceed.
Edit - Looks like we land here: https://github.com/Kong/kong/blob/0d64a67d34d40e0facd66b015c46f8a6ba994f71/kong/db/migrations/core/009_200_to_210.lua#L251 , wonder why the prior: local default_ws, err = operations.cassandra_ensure_default_ws(coordinator)
isn't returning an err, its just got default_ws as nil I suppose... when I am able to query and see it just fine in the C* db itself on all nodes in that datacenter it could possibly be using.
@jeremyjpj0916, the query is here: https://github.com/Kong/kong/blob/0d64a67d34d40e0facd66b015c46f8a6ba994f71/kong/db/migrations/operations/200_to_210.lua#L24
I like that we got a new problem as that might make us find the real issue. It seems like query did not return rows
. Also it seems like that query does not return rows even when you start the migration again.
Is this query wrong:
SELECT id FROM workspaces WHERE name='default'
Can you execute it on cql
?
@bungle, Yep works like a charm, so I guess the question is why does the cql statement not work in the context of Kong for me, and for multiple runs even of migration finish. And what is different about your first big PR that got scratched that did cause migrations to seemingly work and get past this step, vs what we have now that gets stuck on it in pr 6439:
***@cqlsh:kong_prod2> select * from workspaces;
id | comment | config | created_at | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
926762fb-7c62-4ddb-8535-5577af0cefac | null | null | 2020-10-13 05:15:11.000000+0000 | null | default
(1 rows)
***@cqlsh:kong_prod2> SELECT id FROM workspaces WHERE name='default';
id
--------------------------------------
926762fb-7c62-4ddb-8535-5577af0cefac
(1 rows)
Summary
Thought we were out of the woods here. But I just tested migrations from a 2.0.5 to 2.1.4 DB upgrades with our prod keyspace cloned (decided dry runs of this migrations stuff is best with real keyspace data too) and things went sideways again. Captured logs of each migrations execution from the up to having to run finish multiple times to reach a conclusion, detailed below:
First the Kong migrations up:
So far so good right? Well its after the up that the finish starts to see problems, (removed env variable dump from verbosity for brevity):
Second run of finish to try again:
Third Finish....:
4th Finish...... :
Lastly confirmed via list and finish again that there was nothing left to do:
Looks like re-entrancy is still a problem for sure with the workspace tables, likely correlating to how many times I re-executed the finish command trying to get through this upgrade since the finish failed me a few times in a row and seemed to progress 1 table at a time with each failure:
Also find it interesting how I set the C* schema consensus timeouts long but it bounces back instantly saying consensus not reached? Ex:
cc @bungle @locao @hishamhm
Edit:
Yeah I see some tables like oauth2_credentials using ws_id: 38321a07-1d68-4aa8-b01f-8d64d5bf665a and jwt_secrets using ws_id: 8f1d01e1-0609-4bda-a021-a01e619d1590
Additional Details & Logs