Kong 2.0.5 -> 2.1.4 Migratons C* issues still possible.

Summary

Thought we were out of the woods here. But I just tested migrations from a 2.0.5 to 2.1.4 DB upgrades with our prod keyspace cloned (decided dry runs of this migrations stuff is best with real keyspace data too) and things went sideways again. Captured logs of each migrations execution from the up to having to run finish multiple times to reach a conclusion, detailed below:

First the Kong migrations up:

/ $ kong migrations up --db-timeout 120 --vv
2020/09/28 17:17:57 [verbose] Kong: 2.1.4
2020/09/28 17:17:57 [debug] ngx_lua: 10015
2020/09/28 17:17:57 [debug] nginx: 1015008
2020/09/28 17:17:57 [debug] Lua: LuaJIT 2.1.0-beta3
2020/09/28 17:17:57 [verbose] no config file found at /etc/kong/kong.conf
2020/09/28 17:17:57 [verbose] no config file found at /etc/kong.conf
2020/09/28 17:17:57 [verbose] no config file, skip loading
2020/09/28 17:17:57 [debug] reading environment variables
2020/09/28 17:17:57 [debug] KONG_PLUGINS ENV found with "kong-siteminder-auth,kong-kafka-log,stargate-waf-error-log,mtls,stargate-oidc-token-revoke,kong-tx-debugger,kong-plugin-oauth,zipkin,kong-error-log,kong-oidc-implicit-token,kong-response-size-limiting,request-transformer,kong-service-virtualization,kong-cluster-drain,kong-upstream-jwt,kong-splunk-log,kong-spec-expose,kong-path-based-routing,kong-oidc-multi-idp,correlation-id,oauth2,statsd,jwt,rate-limiting,acl,request-size-limiting,request-termination,cors"
2020/09/28 17:17:57 [debug] KONG_ADMIN_LISTEN ENV found with "0.0.0.0:8001 deferred reuseport"
2020/09/28 17:17:57 [debug] KONG_PROXY_ACCESS_LOG ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_USERNAME ENV found with "kongdba"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_PASSWORD ENV found with "******"
2020/09/28 17:17:57 [debug] KONG_PROXY_LISTEN ENV found with "0.0.0.0:8000, 0.0.0.0:8443 ssl http2 deferred reuseport"
2020/09/28 17:17:57 [debug] KONG_DNS_NO_SYNC ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_DB_UPDATE_PROPAGATION ENV found with "5"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_PORT ENV found with "9042"
2020/09/28 17:17:57 [debug] KONG_HEADERS ENV found with "latency_tokens"
2020/09/28 17:17:57 [debug] KONG_DNS_STALE_TTL ENV found with "4"
2020/09/28 17:17:57 [debug] KONG_WAF_DEBUG_LEVEL ENV found with "0"
2020/09/28 17:17:57 [debug] KONG_WAF_PARANOIA_LEVEL ENV found with "1"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_REFRESH_FREQUENCY ENV found with "0"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_CONTACT_POINTS ENV found with "server05503,server05505,server05507,server05502,server05504,server05506"
2020/09/28 17:17:57 [debug] KONG_DB_CACHE_WARMUP_ENTITIES ENV found with "services,plugins,consumers"
2020/09/28 17:17:57 [debug] KONG_NGINX_HTTP_SSL_PROTOCOLS ENV found with "TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_LOCAL_DATACENTER ENV found with "DC1"
2020/09/28 17:17:57 [debug] KONG_UPSTREAM_KEEPALIVE_IDLE_TIMEOUT ENV found with"30"
2020/09/28 17:17:57 [debug] KONG_DB_CACHE_TTL ENV found with "0"
2020/09/28 17:17:57 [debug] KONG_PG_SSL ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_WAF_REQUEST_NO_FILE_SIZE_LIMIT ENV found with "50000000"
2020/09/28 17:17:57 [debug] KONG_WAF_PCRE_MATCH_LIMIT_RECURSION ENV found with "10000"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_SCHEMA_CONSENSUS_TIMEOUT ENV found with "30000"
2020/09/28 17:17:57 [debug] KONG_LOG_LEVEL ENV found with "notice"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_TIMEOUT ENV found with "5000"
2020/09/28 17:17:57 [debug] KONG_NGINX_MAIN_WORKER_PROCESSES ENV found with "6"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_KEYSPACE ENV found with "kong_prod2"
2020/09/28 17:17:57 [debug] KONG_WAF ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_ERROR_DEFAULT_TYPE ENV found with "text/plain"
2020/09/28 17:17:57 [debug] KONG_UPSTREAM_KEEPALIVE_POOL_SIZE ENV found with "400"
2020/09/28 17:17:57 [debug] KONG_WORKER_CONSISTENCY ENV found with "eventual"
2020/09/28 17:17:57 [debug] KONG_CLIENT_SSL ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_TRUSTED_IPS ENV found with "0.0.0.0/0,::/0"
2020/09/28 17:17:57 [debug] KONG_SSL_CERT_KEY ENV found with "/usr/local/kong/ssl/kongprivatekey.key"
2020/09/28 17:17:57 [debug] KONG_MEM_CACHE_SIZE ENV found with "1024m"
2020/09/28 17:17:57 [debug] KONG_NGINX_PROXY_REAL_IP_HEADER ENV found with "X-Forwarded-For"
2020/09/28 17:17:57 [debug] KONG_DB_UPDATE_FREQUENCY ENV found with "5"
2020/09/28 17:17:57 [debug] KONG_DNS_ORDER ENV found with "LAST,SRV,A,CNAME"
2020/09/28 17:17:57 [debug] KONG_DNS_ERROR_TTL ENV found with "1"
2020/09/28 17:17:57 [debug] KONG_DATABASE ENV found with "cassandra"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_DATA_CENTERS ENV found with "DC1:3,DC2:3"
2020/09/28 17:17:57 [debug] KONG_WORKER_STATE_UPDATE_FREQUENCY ENV found with "5"
2020/09/28 17:17:57 [debug] KONG_LUA_SSL_VERIFY_DEPTH ENV found with "3"
2020/09/28 17:17:57 [debug] KONG_LUA_SOCKET_POOL_SIZE ENV found with "30"
2020/09/28 17:17:57 [debug] KONG_UPSTREAM_KEEPALIVE_MAX_REQUESTS ENV found with"50000"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_CONSISTENCY ENV found with "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] KONG_CLIENT_MAX_BODY_SIZE ENV found with "50m"
2020/09/28 17:17:57 [debug] KONG_ADMIN_ERROR_LOG ENV found with "/dev/stderr"
2020/09/28 17:17:57 [debug] KONG_DNS_NOT_FOUND_TTL ENV found with "30"
2020/09/28 17:17:57 [debug] KONG_PROXY_ERROR_LOG ENV found with "/dev/stderr"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_REPL_STRATEGY ENV found with "NetworkTopologyStrategy"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_SSL_VERIFY ENV found with "on"
2020/09/28 17:17:57 [debug] KONG_ADMIN_ACCESS_LOG ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_DNS_HOSTSFILE ENV found with "/etc/hosts"
2020/09/28 17:17:57 [debug] KONG_WAF_REQUEST_FILE_SIZE_LIMIT ENV found with "50000000"
2020/09/28 17:17:57 [debug] KONG_SSL_CERT ENV found with "/usr/local/kong/ssl/kongcert.crt"
2020/09/28 17:17:57 [debug] KONG_NGINX_PROXY_REAL_IP_RECURSIVE ENV found with "on"
2020/09/28 17:17:57 [debug] KONG_SSL_CIPHER_SUITE ENV found with "intermediate"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_SSL ENV found with "on"
2020/09/28 17:17:57 [debug] KONG_ANONYMOUS_REPORTS ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_WAF_MODE ENV found with "On"
2020/09/28 17:17:57 [debug] KONG_CLIENT_BODY_BUFFER_SIZE ENV found with "50m"
2020/09/28 17:17:57 [debug] KONG_WAF_PCRE_MATCH_LIMIT ENV found with "10000"
2020/09/28 17:17:57 [debug] KONG_LUA_SSL_TRUSTED_CERTIFICATE ENV found with "/usr/local/kong/ssl/kongcert.pem"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_LB_POLICY ENV found with "RequestDCAwareRoundRobin"
2020/09/28 17:17:57 [debug] KONG_WAF_AUDIT ENV found with "RelevantOnly"
2020/09/28 17:17:57 [debug] admin_access_log = "off"
2020/09/28 17:17:57 [debug] admin_error_log = "/dev/stderr"
2020/09/28 17:17:57 [debug] admin_listen = {"0.0.0.0:8001 deferred reuseport"}
2020/09/28 17:17:57 [debug] anonymous_reports = false
2020/09/28 17:17:57 [debug] cassandra_consistency = "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] cassandra_contact_points = {"apsrp05503","apsrp05505","apsrp05507","apsrp05502","apsrp05504","apsrp05506"}
2020/09/28 17:17:57 [debug] cassandra_data_centers = {"DC1:3","DC2:3"}
2020/09/28 17:17:57 [debug] cassandra_keyspace = "kong_prod2"
2020/09/28 17:17:57 [debug] cassandra_lb_policy = "RequestDCAwareRoundRobin"
2020/09/28 17:17:57 [debug] cassandra_local_datacenter = "DC1"
2020/09/28 17:17:57 [debug] cassandra_password = "******"
2020/09/28 17:17:57 [debug] cassandra_port = 9042
2020/09/28 17:17:57 [debug] cassandra_read_consistency = "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] cassandra_refresh_frequency = 0
2020/09/28 17:17:57 [debug] cassandra_repl_factor = 1
2020/09/28 17:17:57 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2020/09/28 17:17:57 [debug] cassandra_schema_consensus_timeout = 30000
2020/09/28 17:17:57 [debug] cassandra_ssl = true
2020/09/28 17:17:57 [debug] cassandra_ssl_verify = true
2020/09/28 17:17:57 [debug] cassandra_timeout = 5000
2020/09/28 17:17:57 [debug] cassandra_username = "kongdba"
2020/09/28 17:17:57 [debug] cassandra_write_consistency = "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] client_body_buffer_size = "50m"
2020/09/28 17:17:57 [debug] client_max_body_size = "50m"
2020/09/28 17:17:57 [debug] client_ssl = false
2020/09/28 17:17:57 [debug] cluster_control_plane = "127.0.0.1:8005"
2020/09/28 17:17:57 [debug] cluster_listen = {"0.0.0.0:8005"}
2020/09/28 17:17:57 [debug] cluster_mtls = "shared"
2020/09/28 17:17:57 [debug] database = "cassandra"
2020/09/28 17:17:57 [debug] db_cache_ttl = 0
2020/09/28 17:17:57 [debug] db_cache_warmup_entities = {"services","plugins","consumers"}
2020/09/28 17:17:57 [debug] db_resurrect_ttl = 30
2020/09/28 17:17:57 [debug] db_update_frequency = 5
2020/09/28 17:17:57 [debug] db_update_propagation = 5
2020/09/28 17:17:57 [debug] dns_error_ttl = 1
2020/09/28 17:17:57 [debug] dns_hostsfile = "/etc/hosts"
2020/09/28 17:17:57 [debug] dns_no_sync = false
2020/09/28 17:17:57 [debug] dns_not_found_ttl = 30
2020/09/28 17:17:57 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2020/09/28 17:17:57 [debug] dns_resolver = {}
2020/09/28 17:17:57 [debug] dns_stale_ttl = 4
2020/09/28 17:17:57 [debug] error_default_type = "text/plain"
2020/09/28 17:17:57 [debug] go_plugins_dir = "off"
2020/09/28 17:17:57 [debug] go_pluginserver_exe = "/usr/local/bin/go-pluginserver"
2020/09/28 17:17:57 [debug] headers = {"latency_tokens"}
2020/09/28 17:17:57 [debug] host_ports = {}
2020/09/28 17:17:57 [debug] kic = false
2020/09/28 17:17:57 [debug] log_level = "notice"
2020/09/28 17:17:57 [debug] lua_package_cpath = ""
2020/09/28 17:17:57 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2020/09/28 17:17:57 [debug] lua_socket_pool_size = 30
2020/09/28 17:17:57 [debug] lua_ssl_trusted_certificate = "/usr/local/kong/ssl/kongcert.pem"
2020/09/28 17:17:57 [debug] lua_ssl_verify_depth = 3
2020/09/28 17:17:57 [debug] mem_cache_size = "1024m"
2020/09/28 17:17:57 [debug] nginx_admin_directives = {}
2020/09/28 17:17:57 [debug] nginx_daemon = "on"
2020/09/28 17:17:57 [debug] nginx_events_directives = {{name="multi_accept",value="on"},{name="worker_connections",value="auto"}}
2020/09/28 17:17:57 [debug] nginx_events_multi_accept = "on"
2020/09/28 17:17:57 [debug] nginx_events_worker_connections = "auto"
2020/09/28 17:17:57 [debug] nginx_http_client_body_buffer_size = "50m"
2020/09/28 17:17:57 [debug] nginx_http_client_max_body_size = "50m"
2020/09/28 17:17:57 [debug] nginx_http_directives = {{name="client_max_body_size",value="50m"},{name="ssl_prefer_server_ciphers",value="off"},{name="client_body_buffer_size",value="50m"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_session_timeout",value="1d"},{name="ssl_session_tickets",value="on"}}
2020/09/28 17:17:57 [debug] nginx_http_ssl_prefer_server_ciphers = "off"
2020/09/28 17:17:57 [debug] nginx_http_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] nginx_http_ssl_session_tickets = "on"
2020/09/28 17:17:57 [debug] nginx_http_ssl_session_timeout = "1d"
2020/09/28 17:17:57 [debug] nginx_http_status_directives = {}
2020/09/28 17:17:57 [debug] nginx_http_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
2020/09/28 17:17:57 [debug] nginx_http_upstream_keepalive = "60"
2020/09/28 17:17:57 [debug] nginx_http_upstream_keepalive_requests = "100"
2020/09/28 17:17:57 [debug] nginx_http_upstream_keepalive_timeout = "60s"
2020/09/28 17:17:57 [debug] nginx_main_daemon = "on"
2020/09/28 17:17:57 [debug] nginx_main_directives = {{name="daemon",value="on"},{name="worker_rlimit_nofile",value="auto"},{name="worker_processes",value="6"}}
2020/09/28 17:17:57 [debug] nginx_main_worker_processes = "6"
2020/09/28 17:17:57 [debug] nginx_main_worker_rlimit_nofile = "auto"
2020/09/28 17:17:57 [debug] nginx_optimizations = true
2020/09/28 17:17:57 [debug] nginx_proxy_directives = {{name="real_ip_header",value="X-Forwarded-For"},{name="real_ip_recursive",value="on"}}
2020/09/28 17:17:57 [debug] nginx_proxy_real_ip_header = "X-Forwarded-For"
2020/09/28 17:17:57 [debug] nginx_proxy_real_ip_recursive = "on"
2020/09/28 17:17:57 [debug] nginx_sproxy_directives = {}
2020/09/28 17:17:57 [debug] nginx_status_directives = {}
2020/09/28 17:17:57 [debug] nginx_stream_directives = {{name="ssl_session_tickets",value="on"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_session_timeout",value="1d"},{name="ssl_prefer_server_ciphers",value="off"}}
2020/09/28 17:17:57 [debug] nginx_stream_ssl_prefer_server_ciphers = "off"
2020/09/28 17:17:57 [debug] nginx_stream_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] nginx_stream_ssl_session_tickets = "on"
2020/09/28 17:17:57 [debug] nginx_stream_ssl_session_timeout = "1d"
2020/09/28 17:17:57 [debug] nginx_supstream_directives = {}
2020/09/28 17:17:57 [debug] nginx_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
2020/09/28 17:17:57 [debug] nginx_upstream_keepalive = "60"
2020/09/28 17:17:57 [debug] nginx_upstream_keepalive_requests = "100"
2020/09/28 17:17:57 [debug] nginx_upstream_keepalive_timeout = "60s"
2020/09/28 17:17:57 [debug] nginx_worker_processes = "auto"
2020/09/28 17:17:57 [debug] pg_database = "kong"
2020/09/28 17:17:57 [debug] pg_host = "127.0.0.1"
2020/09/28 17:17:57 [debug] pg_max_concurrent_queries = 0
2020/09/28 17:17:57 [debug] pg_port = 5432
2020/09/28 17:17:57 [debug] pg_ro_ssl = false
2020/09/28 17:17:57 [debug] pg_ro_ssl_verify = false
2020/09/28 17:17:57 [debug] pg_semaphore_timeout = 60000
2020/09/28 17:17:57 [debug] pg_ssl = false
2020/09/28 17:17:57 [debug] pg_ssl_verify = false
2020/09/28 17:17:57 [debug] pg_timeout = 5000
2020/09/28 17:17:57 [debug] pg_user = "kong"
2020/09/28 17:17:57 [debug] plugins = {"kong-siteminder-auth","kong-kafka-log","stargate-waf-error-log","mtls","stargate-oidc-token-revoke","kong-tx-debugger","kong-plugin-oauth","zipkin","kong-error-log","kong-oidc-implicit-token","kong-response-size-limiting","request-transformer","kong-service-virtualization","kong-cluster-drain","kong-upstream-jwt","kong-splunk-log","kong-spec-expose","kong-path-based-routing","kong-oidc-multi-idp","correlation-id","oauth2","statsd","jwt","rate-limiting","acl","request-size-limiting","request-termination","cors"}
2020/09/28 17:17:57 [debug] port_maps = {}
2020/09/28 17:17:57 [debug] prefix = "/usr/local/kong/"
2020/09/28 17:17:57 [debug] proxy_access_log = "off"
2020/09/28 17:17:57 [debug] proxy_error_log = "/dev/stderr"
2020/09/28 17:17:57 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl http2 deferred reuseport"}
2020/09/28 17:17:57 [debug] real_ip_header = "X-Real-IP"
2020/09/28 17:17:57 [debug] real_ip_recursive = "off"
2020/09/28 17:17:57 [debug] role = "traditional"
2020/09/28 17:17:57 [debug] ssl_cert = "/usr/local/kong/ssl/kongcert.crt"
2020/09/28 17:17:57 [debug] ssl_cert_key = "/usr/local/kong/ssl/kongprivatekey.key"
2020/09/28 17:17:57 [debug] ssl_cipher_suite = "intermediate"
2020/09/28 17:17:57 [debug] ssl_ciphers = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"
2020/09/28 17:17:57 [debug] ssl_prefer_server_ciphers = "on"
2020/09/28 17:17:57 [debug] ssl_protocols = "TLSv1.1 TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] ssl_session_tickets = "on"
2020/09/28 17:17:57 [debug] ssl_session_timeout = "1d"
2020/09/28 17:17:57 [debug] status_access_log = "off"
2020/09/28 17:17:57 [debug] status_error_log = "logs/status_error.log"
2020/09/28 17:17:57 [debug] status_listen = {"off"}
2020/09/28 17:17:57 [debug] stream_listen = {"off"}
2020/09/28 17:17:57 [debug] trusted_ips = {"0.0.0.0/0","::/0"}
2020/09/28 17:17:57 [debug] upstream_keepalive = 60
2020/09/28 17:17:57 [debug] upstream_keepalive_idle_timeout = 30
2020/09/28 17:17:57 [debug] upstream_keepalive_max_requests = 50000
2020/09/28 17:17:57 [debug] upstream_keepalive_pool_size = 400
2020/09/28 17:17:57 [debug] waf = "off"
2020/09/28 17:17:57 [debug] waf_audit = "RelevantOnly"
2020/09/28 17:17:57 [debug] waf_debug_level = "0"
2020/09/28 17:17:57 [debug] waf_mode = "On"
2020/09/28 17:17:57 [debug] waf_paranoia_level = "1"
2020/09/28 17:17:57 [debug] waf_pcre_match_limit = "10000"
2020/09/28 17:17:57 [debug] waf_pcre_match_limit_recursion = "10000"
2020/09/28 17:17:57 [debug] waf_request_file_size_limit = "50000000"
2020/09/28 17:17:57 [debug] waf_request_no_file_size_limit = "50000000"
2020/09/28 17:17:57 [debug] worker_consistency = "eventual"
2020/09/28 17:17:57 [debug] worker_state_update_frequency = 5
2020/09/28 17:17:57 [verbose] prefix in use: /usr/local/kong
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05503' to: 10.204.90.234
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05505' to: 10.204.90.235
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05507' to: 10.86.173.32
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05502' to: 10.106.184.179
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05504' to: 10.106.184.193
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05506' to: 10.87.52.252
2020/09/28 17:17:58 [debug] loading subsystems migrations...
2020/09/28 17:17:58 [verbose] retrieving keyspace schema state...
2020/09/28 17:17:58 [verbose] schema state retrieved
2020/09/28 17:17:58 [debug] loading subsystems migrations...
2020/09/28 17:17:58 [verbose] retrieving keyspace schema state...
2020/09/28 17:17:58 [verbose] schema state retrieved
2020/09/28 17:17:58 [debug] migrations to run:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:17:58 [info] migrating core on keyspace 'kong_prod2'...
2020/09/28 17:17:58 [debug] running migration: 009_200_to_210
2020/09/28 17:18:02 [info] core migrated up to: 009_200_to_210 (pending)
2020/09/28 17:18:02 [debug] running migration: 010_210_to_211
2020/09/28 17:18:02 [info] core migrated up to: 010_210_to_211 (pending)
2020/09/28 17:18:02 [debug] running migration: 011_212_to_213
2020/09/28 17:18:02 [info] core migrated up to: 011_212_to_213 (executed)
2020/09/28 17:18:02 [info] migrating jwt on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 003_200_to_210
2020/09/28 17:18:02 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/09/28 17:18:02 [info] migrating acl on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 003_200_to_210
2020/09/28 17:18:02 [info] acl migrated up to: 003_200_to_210 (pending)
2020/09/28 17:18:02 [debug] running migration: 004_212_to_213
2020/09/28 17:18:02 [info] acl migrated up to: 004_212_to_213 (pending)
2020/09/28 17:18:02 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 004_200_to_210
2020/09/28 17:18:02 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/09/28 17:18:02 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 004_200_to_210
2020/09/28 17:18:03 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/09/28 17:18:03 [debug] running migration: 005_210_to_211
2020/09/28 17:18:03 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/09/28 17:18:03 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:18:09 [verbose] Cassandra schema consensus: reached
2020/09/28 17:18:09 [info] 9 migrations processed
2020/09/28 17:18:09 [info] 2 executed
2020/09/28 17:18:09 [info] 7 pending
2020/09/28 17:18:09 [debug] loading subsystems migrations...
2020/09/28 17:18:09 [verbose] retrieving keyspace schema state...
2020/09/28 17:18:09 [verbose] schema state retrieved
2020/09/28 17:18:09 [info]
Database has pending migrations; run 'kong migrations finish' when ready

So far so good right? Well its after the up that the finish starts to see problems, (removed env variable dump from verbosity for brevity):

/ $ kong migrations finish --db-timeout 120 --vv
2020/09/28 17:19:57 [debug] loading subsystems migrations...
2020/09/28 17:19:57 [verbose] retrieving keyspace schema state...
2020/09/28 17:19:57 [verbose] schema state retrieved
2020/09/28 17:19:57 [debug] loading subsystems migrations...
2020/09/28 17:19:57 [verbose] retrieving keyspace schema state...
2020/09/28 17:19:58 [verbose] schema state retrieved
2020/09/28 17:19:58 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:19:58 [info] migrating core on keyspace 'kong_prod2'...
2020/09/28 17:19:58 [debug] running migration: 009_200_to_210
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 009_200_to_210 (executed)
2020/09/28 17:20:12 [debug] running migration: 010_210_to_211
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 010_210_to_211 (executed)
2020/09/28 17:20:12 [info] migrating jwt on keyspace 'kong_prod2'...
2020/09/28 17:20:12 [debug] running migration: 003_200_to_210
2020/09/28 17:20:14 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:14 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Second run of finish to try again:

2020/09/28 17:22:14 [debug] loading subsystems migrations...
2020/09/28 17:22:14 [verbose] retrieving keyspace schema state...
2020/09/28 17:22:14 [verbose] schema state retrieved
2020/09/28 17:22:15 [debug] loading subsystems migrations...
2020/09/28 17:22:15 [verbose] retrieving keyspace schema state...
2020/09/28 17:22:15 [verbose] schema state retrieved
2020/09/28 17:22:15 [debug] pending migrations to finish:
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:22:15 [info] migrating acl on keyspace 'kong_prod2'...
2020/09/28 17:22:15 [debug] running migration: 003_200_to_210
2020/09/28 17:22:28 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:22:28 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Third Finish....:

2020/09/28 17:23:28 [verbose] schema state retrieved
2020/09/28 17:23:28 [debug] loading subsystems migrations...
2020/09/28 17:23:28 [verbose] retrieving keyspace schema state...
2020/09/28 17:23:29 [verbose] schema state retrieved
2020/09/28 17:23:29 [debug] pending migrations to finish:
   acl: 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:23:29 [info] migrating acl on keyspace 'kong_prod2'...
2020/09/28 17:23:29 [debug] running migration: 004_212_to_213
2020/09/28 17:23:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:23:29 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

4th Finish...... :

2020/09/28 17:24:15 [debug] loading subsystems migrations...
2020/09/28 17:24:15 [verbose] retrieving keyspace schema state...
2020/09/28 17:24:15 [verbose] schema state retrieved
2020/09/28 17:24:15 [debug] loading subsystems migrations...
2020/09/28 17:24:15 [verbose] retrieving keyspace schema state...
2020/09/28 17:24:16 [verbose] schema state retrieved
2020/09/28 17:24:16 [debug] pending migrations to finish:
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:24:16 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/09/28 17:24:16 [debug] running migration: 004_200_to_210
2020/09/28 17:24:18 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:24:18 [verbose] Cassandra schema consensus: reached
2020/09/28 17:24:18 [info] oauth2 migrated up to: 004_200_to_210 (executed)
2020/09/28 17:24:18 [debug] running migration: 005_210_to_211
2020/09/28 17:24:18 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:24:18 [verbose] Cassandra schema consensus: reached
2020/09/28 17:24:18 [info] oauth2 migrated up to: 005_210_to_211 (executed)
2020/09/28 17:24:18 [info] 2 migrations processed
2020/09/28 17:24:18 [info] 2 executed
2020/09/28 17:24:18 [debug] loading subsystems migrations...
2020/09/28 17:24:18 [verbose] retrieving keyspace schema state...
2020/09/28 17:24:18 [verbose] schema state retrieved
2020/09/28 17:24:18 [info] No pending migrations to finish

Lastly confirmed via list and finish again that there was nothing left to do:

2020/09/28 17:25:35 [debug] loading subsystems migrations...
2020/09/28 17:25:35 [verbose] retrieving keyspace schema state...
2020/09/28 17:25:35 [verbose] schema state retrieved
2020/09/28 17:25:35 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
          acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210
       oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211

Looks like re-entrancy is still a problem for sure with the workspace tables, likely correlating to how many times I re-executed the finish command trying to get through this upgrade since the finish failed me a few times in a row and seemed to progress 1 table at a time with each failure:

Also find it interesting how I set the C* schema consensus timeouts long but it bounces back instantly saying consensus not reached? Ex:

2020/09/28 17:23:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:23:29 [verbose] Cassandra schema consensus: not reached

cc @bungle @locao @hishamhm

Edit:

Yeah I see some tables like oauth2_credentials using ws_id: 38321a07-1d68-4aa8-b01f-8d64d5bf665a and jwt_secrets using ws_id: 8f1d01e1-0609-4bda-a021-a01e619d1590

Additional Details & Logs

Kong version 2.1.4

Thank you for the detailed report as always! We'll be looking into it.

Yeah I think it would have worked fine if the finish had not died reaching schema consensus and continued processing. But that can't always be the case it looks like for us, therefore can't be assumed in code. I tried this same thing in another of our prod environments and same errors happened too. I decided to check the C nodes as well looking for any discrepancies in schema mismatches or cluster healthy but nothing shows up there either. Because this may point to C issues potentially as well alongside Kong not being able to handle all situations(such as restarting a migrations finish because the first execution didn't work out).

root@server05502:/root
# nodetool describecluster
Cluster Information:
    Name: Stargate
    Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
    DynamicEndPointSnitch: enabled
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        12b6ca53-8e84-3a66-8793-7ce091b5dcca: [10.204.90.234, 10.86.173.32, 10.204.90.235, 10.106.184.179, 10.106.184.193, 10.87.52.252]

root@server05502:/root
# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.204.90.234   113.89 MiB  256          100.0%            5be9bdf7-1337-42ac-8177-341044a4758b  RACK2
UN  10.86.173.32    117.4 MiB  256          100.0%            0f8d693c-4d99-453b-826e-6e5c68538037  RACK6
UN  10.204.90.235   335.54 MiB  256          100.0%            8e3192a7-e60c-41fb-8ca6-ce809440d982  RACK4
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.106.184.179  117.48 MiB  256          100.0%            33747998-70fd-42c4-87cb-c7c49a182c3e  RACK1
UN  10.87.52.252    104.64 MiB  256          100.0%            74a60b30-c83c-4aa8-b3ae-7ccb16fb817f  RACK5
UN  10.106.184.193  107.06 MiB  256          100.0%            61320d76-aa12-43f0-9684-b1413e57c90d  RACK3

So that wasn't any help. I wonder if there are any analytic health checks I can run on the cloned keyspace itself to check for weirdness or mismatches too.

Edit - Used another node in other DC w same commands above and they checked out fine too. 😢 so no smoking gun there as far as why C* could be the culprit to some of this. Regardless hope Kong enables the finish to be safer if it needs to execute multiple times and still achieve 1 ws_id all the tables end up using and have in sync together.

It certainly feels somekind of eventual consistency issue. We now insert the workspace only when after doing read-before-write (AFAIK), but perhaps that read will return nil when the write is done eventually, making us to do write again. And as the default ws id can be different, we cannot just do update.

Is it not possible to add logic into the migrations process for the finish command in code such that if a workspace table has a default record in it, use that record to continue processing? Or would this potentially break other enterprise use cases is the reservation there?

Or maybe thats exactly what you meant here?

We now insert the workspace only when after doing read-before-write (AFAIK), but perhaps that read will return nil when the write is done eventually

But I imagine independent executions of the finish command don't keep up with checking if there was already a ws generated, it probably just generates a fresh one and drops it in based on behavior I have seen?

Maybe the case we do full replication and local quorum r/w and all our Kong C* settings play a roll in this problem but we do so because it's a must to ensure OAuth2.0 bearer token plugin functions correctly for us .

To debug the behavior even if you cant produce this against your own localhost C* cluster:

...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures

I imagine can do manual break points in the code to reproduce the re-entrance of injecting ws id's when finish runs multiple times and stops at a certain portion of its execution like above situations happened for us.

Or maybe thats exactly what you meant here?

Yes, that's what we try to do. During finish, we first try read the default workspace value to use it, and if we get nothing, we create it. Looks like the SELECT is returning an empty response.

Maybe the case we do full replication and local quorum r/w and all our Kong C* settings play a roll in this problem but we do so because it's a must to ensure OAuth2.0 bearer token plugin functions correctly for us .

If anything, that should have brought stronger consistency, right?

Puzzling.

I'm wondering if chasing after this isn't a more fruitful avenue:

Also find it interesting how I set the C* schema consensus timeouts long but it bounces back instantly saying consensus not reached? Ex:

2020/09/28 17:23:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)... 2020/09/28 17:23:29 [verbose] Cassandra schema consensus: not reached

Hmm if it was eventual consistency problem though the time between executing the finish multiple times could not be that though :/ ... Pretty sure I could wait 10 mins between each time I run finish and it would still result in the new ws_id's being entered meaning that first select must always be failing somehow for us... odd.

And yeah one obvious red flag to me is how we set the schema consensus timeout huge but does not seem like its being honored. It either snaps back right away due to an underlying error(thats not printed or details given) or it just straight up won't wait for the given time and the wait is being derived from some other value either default or hardcoded.

I believe that it fails here: https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L755-L764

Which means either of these fail: https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L726-L730

It is busy looping, but it seems like it fails quite quick on either of those queries.

As a workaround, one thing is for us to try to reach schema consensus more than once. But as you said, even if you waited you could still get the issue. I also thinking that Cassandra indexes could lie or that some of the queries are not executed on right consistency level. Also could nodes go down while migrating, but you got it 4 times a row, so...

Can you execute those queries:

SELECT schema_version FROM system.local
-- AND
SELECT schema_version FROM system.peers

and share the data when this happens (or is about to happen)?

@jeremyjpj0916, could it be a tombstone related issue: https://stackoverflow.com/a/37119894

Okay, It might be that we talk about two different issues.

duplicated default workspaces
waiting for schema consensus fails

For 1: duplicated default workspaces:

We might want to ensure that these: https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L24 https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L43 https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/212_to_213.lua#L24 https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/212_to_213.lua#L43

Runs on stricter consistency. At the moment we are not sure if changing the consistency level fixes it, but it is a good guess.

For 2: waiting for schema consensus to fails:

It feels like it is failing here: https://github.com/Kong/kong/blob/master/kong/db/init.lua#L577

And Cassandra implements it here: https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L396-L415

For some reason it fails sometimes with:

[Read failure] Operation failed - received 0 responses and 1 failures

This error message is coming from Cassandra itself, and is forwarded from here: https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L726-L730

At the moment we don't fully understand why this error happens (tombstones?), but it seems to happen on a rather non-fatal place. We could try it more, e.g. try it for the full max_schema_consensus_wait (120000ms timeout), and not fail on the first [Read failure].

@jeremyjpj0916 there is now #6411, would it be too much asked if you could try that, and report back if it has any effect on your environment? That branch does not have much more that that compared to latest 2.1.4. as can be seen here: https://github.com/Kong/kong/compare/2.1.4...fix/cassandra-migrations

Just tried the two commands randomly on a given prod node where I had been testing these migrations on a cloned keyspace, but outside of context of the processing for how kong uses the C* lua client lib tibo wrote not sure if this helps, just shows these commands do generally work in a vacuume:

Can you execute those queries:

SELECT schema_version FROM system.local -- AND SELECT schema_version FROM system.peers

root@server05502:/root
# cqlsh --ssl
Connected to Stargate at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
****@cqlsh> SELECT schema_version FROM system.local;

 schema_version
--------------------------------------
 12b6ca53-8e84-3a66-8793-7ce091b5dcca

(1 rows)
****@cqlsh> SELECT schema_version FROM system.peers;

 schema_version
--------------------------------------
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca

(5 rows)
****@cqlsh>

As for tombstones I remember increasing the tombstone fail threshold higher too. Also making the gc policy more aggressive than the default:

ALTER TABLE kong_stage3.ratelimiting_metrics WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.locks WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_tokens WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.locks WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.snis WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.plugins WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.targets WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.consumers WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.upstreams WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.schema_meta WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.response_ratelimiting_metrics WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.tags WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.acls WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.cluster_ca WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.cluster_events WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.jwt_secrets WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.apis WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.basicauth_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.keyauth_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.certificates WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.services WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_authorization_codes WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.routes WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.hmacauth_credentials WITH GC_GRACE_SECONDS = 3600;

Which is 1 hr vs the default 10 days for GC grace. I have forgotten if the default_time_to_live field impacts anything as well with the gc grace set before then though. Are you familiar? I may need to google around on that if its not optimal currently too.

Ex:

Then the fail threshold was set to:

# When executing a scan, within or across a partition, we need to keep the
# tombstones seen in memory so we can return them to the coordinator, which
# will use them to make sure other replicas also know about the deleted rows.
# With workloads that generate a lot of tombstones, this can cause performance
# problems and even exaust the server heap.
# (http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets)
# Adjust the thresholds here if you understand the dangers and want to
# scan more tombstones anyway.  These thresholds may also be adjusted at runtime
# using the StorageService mbean.
tombstone_warn_threshold: 10000
tombstone_failure_threshold: 500000

vs the default of 100,000

What would the migrations be querying each time that would cause that though? The cluster events table get hit each time? I know the oauth2 token table(probably our largest table in terms of records) probably does not get hit on every migrations finish command, especially after its been passed through in the list. I mean I guess could oauth2 tokens table getting in the millions of tombstones within an hour be causing it? OR cluster events_table for that matter? Unsure... not sure how I could ever tune it more than that unless we set it to 0 which is not recommended in C* cluster environments because if a node goes down you start getting "ghost" records or something I have heard since proper consensus can't be achieved via tombstone info. Maybe on a table like oauth2 tokens or cluster_events its less important because the rows have timestamp values that help dictate their validity to kong if they are present in the query or not eh so maybe 0 isn't so bad for some of the more intensely written to tables.

@bungle More than happy to try your latest PR today, won't have results likely till later this evening so in about 10 or so hours due to some restrictions on when I can touch some of our hosts for this testing. Will get that PR of files dropped in so the modified Kong will be ready to go then though. Fingers crossed it yields some positive success.

Appreciate the work you put in trying to resolve this scenario.

Also did do a few queries against our live C* db:

****@cqlsh:kong_prod> select count(*) from cluster_events;

 count
-------
 87271

(1 rows)

Warnings :
Aggregation query used without partition key

Read 100 live rows and 89643 tombstone cells for query SELECT * FROM kong_prod.cluster_events LIMIT 100 (see tombstone_warn_threshold)

Lotta tombstones here for sure.

As for querying the oauth2_tokens table from one of the nodes via cqlsh... uhh thats probably not good:

****@cqlsh:kong_prod> select count(*) from oauth2_tokens;
OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

Need a longer timeout just to figure things out I suppose.

Honestly pondering if not even tombstoning the cluster events and oauth tokens is probably the play... since that kinda data should not really need to be that in strict consensus(okay for a little mismatch data or temporary ghost records right?) since Kong can see timestamps and valid tokens etc. cluster_events may be more important but I am not sure.. pretty sure most of those events are 1 to 1 with oauth token related invalidations.

yep a little select confirms most of those entries too:

*****@cqlsh:kong_prod> select * from cluster_events limit 10;

 channel       | at                              | node_id                              | id                                   | data                                                   | nbf
---------------+---------------------------------+--------------------------------------+--------------------------------------+--------------------------------------------------------+---------------------------------
 invalidations | 2020-10-01 19:44:10.123000+0000 | 3ec49684-1e0b-45ac-a13b-33135cc09164 | 6de48d36-18cd-41e9-b431-6f24f0a507f0 | oauth2_tokens:XXX | 2020-10-01 19:44:15.111000+0000
 invalidations | 2020-10-01 19:44:10.153000+0000 | 3ec49684-1e0b-45ac-a13b-33135cc09164 | c683d31f-c001-4c45-80ba-47b95780f3f4 | oauth2_tokens:XXX  | 2020-10-01 19:44:15.152000+0000
 invalidations | 2020-10-01 19:44:10.167000+0000 | 7dccf201-351a-402e-a33f-f1f291063f87 | 407b95f7-05eb-47f8-bc25-689c3065b9bf | oauth2_tokens:XXX  | 2020-10-01 19:44:15.167000+0000
 invalidations | 2020-10-01 19:44:10.223000+0000 | 21881b66-c99a-470c-ab76-76dcd7fbbc90 | 86180cd0-07ea-43a0-9189-bc63452a817d | oauth2_tokens:XXX  | 2020-10-01 19:44:15.222000+0000
 invalidations | 2020-10-01 19:44:10.230000+0000 | 7456814f-9464-4602-9a1e-7e2b55e4c4b2 | 1aa7f89d-a98c-4a3b-a62f-cbe8458d2b12 | oauth2_tokens:XXX  | 2020-10-01 19:44:15.229000+0000
 invalidations | 2020-10-01 19:44:10.256000+0000 | 21881b66-c99a-470c-ab76-76dcd7fbbc90 | 6b5c3f94-db6b-4103-95c1-95b4f1de2700 | oauth2_tokens:XXX  | 2020-10-01 19:44:15.244000+0000

..... continued ...

Probably not totally related, but the large amount of invalidation events due to oauth2 does start to clog up the cluster events table in general... :

I would think oauth2 client credential grant type tokens don't even need an "invalidation" event really ever? They are generated and valid for a given ttl, unless you could revoke them early(is that even a thing in the oauth2 CC pattern, Kong supports revoking of client credential bearer tokens? we personally have never used such a feature) I think nodes can query the tokens table and know whats valid/invalid based on the token generation timetamps in the table?? Isn't it just a /token endpoint to produce valid tokens with the right client_id and client_secret and bearer type provided for oauth2 cc?

Not super familiar either but nodetool cfstats details on that given oauth2_tokens table looks like this in one prod env:

        Table: oauth2_tokens
        SSTable count: 4
        Space used (live): 153678228
        Space used (total): 153678228
        Space used by snapshots (total): 0
        Off heap memory used (total): 2694448
        SSTable Compression Ratio: 0.5890186627210159
        Number of partitions (estimate): 1614912
        Memtable cell count: 5138
        Memtable data size: 1268839
        Memtable off heap memory used: 0
        Memtable switch count: 1288
        Local read count: 24577751
        Local read latency: 0.067 ms
        Local write count: 53152217
        Local write latency: 0.109 ms
        Pending flushes: 0
        Percent repaired: 70.18
        Bloom filter false positives: 1
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2307288
        Bloom filter off heap memory used: 2307256
        Index summary off heap memory used: 363776
        Compression metadata off heap memory used: 23416
        Compacted partition minimum bytes: 43
        Compacted partition maximum bytes: 179
        Compacted partition mean bytes: 128
        Average live cells per slice (last five minutes): 1.0
        Maximum live cells per slice (last five minutes): 1
        Average tombstones per slice (last five minutes): 1.984081041968162
        Maximum tombstones per slice (last five minutes): 4
        Dropped Mutations: 2052

This is mostly just informational to get an idea of what its like running things like oauth2 plugin in a well used environment.

@jeremyjpj0916 good feedback as always. I will create cards for us to improve oauth2 situation. We have two lines in code:

kong.db.oauth2_authorization_codes:delete({ id = auth_code.id })

and

kong.db.oauth2_tokens:delete({ id = token.id })

There are couple of updates too. The invalidation event is for nodes to clear caches. But let's see if there is a better way.

@bungle

Files all prepped, will be cloning the keyspace again in a few hrs, I noticed you refactored all the core migrations.

Would others such as plugins need a refactor too like from my initial migration logs as well:

 jwt: 003_200_to_210
 acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211

To gain the benefits you are hoping will occur on the core migration executions?

Such as the connector:query() -> coordinator:execute() ?

Asking this because we saw in my first finish execution the core migrations seemed to have gone off without problem, but it was subsequent individual plugin migrations finish calls that seemed to get stuck 1 by 1:

For reference again:

2020/09/28 17:19:58 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:19:58 [info] migrating core on keyspace 'kong_prod2'...
2020/09/28 17:19:58 [debug] running migration: 009_200_to_210
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 009_200_to_210 (executed)
2020/09/28 17:20:12 [debug] running migration: 010_210_to_211
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 010_210_to_211 (executed)
2020/09/28 17:20:12 [info] migrating jwt on keyspace 'kong_prod2'...
2020/09/28 17:20:12 [debug] running migration: 003_200_to_210
2020/09/28 17:20:14 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:14 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Also worth noting odd behavior when cloning the keyspaces I go through and run a script that does some commands, the only table that throws an error is the oauth2 creds table oddly:

error: Unknown column redirect_uri during deserialization
-- StackTrace --
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
    at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:321)
    at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:511)
    at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385)
    at org.apache.cassandra.db.ColumnFamilyStore.loadNewSSTables(ColumnFamilyStore.java:771)
    at org.apache.cassandra.db.ColumnFamilyStore.loadNewSSTables(ColumnFamilyStore.java:709)
    at org.apache.cassandra.service.StorageService.loadNewSSTables(StorageService.java:5110)
    at sun.reflect.GeneratedMethodAccessor519.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
    at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
    at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
    at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
    at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1468)
    at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
    at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
    at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
    at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
    at sun.reflect.GeneratedMethodAccessor547.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
    at sun.rmi.transport.Transport$1.run(Transport.java:200)
    at sun.rmi.transport.Transport$1.run(Transport.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

So the oauth2 creds don't refresh over right away when running the below command(the error above is what happens when I execute the below command):

ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_credentials"

So I follow up with a:

Original shell script doing keyspace cloning I am referring to:

#!/bin/bash
cloneKeyspace()
{
    local username="${1}"
    local start_host="${2}"
    local hostnames="${3}"
    local keyspace_source="${4}"
    local keyspace_destination="${5}"

    if ssh ${username}@${start_host} "echo 2>&1"; then
      ssh ${username}@${start_host} "cqlsh --ssl -e 'DROP KEYSPACE ${keyspace_destination}'"
    fi

    sleep 10

    for hostname in $(echo $hostnames | sed "s/,/ /g")
    do
      if ssh ${username}@${hostname} "echo 2>&1"; then
        ssh ${username}@${hostname} "rm -rf /data/cassandra/data/${keyspace_destination}"
        ssh ${username}@${hostname} "nodetool clearsnapshot -t copy"
      fi
    done

    if ssh ${username}@${start_host} "echo 2>&1"; then
     ssh ${username}@${start_host} "cqlsh --ssl -e 'DESCRIBE KEYSPACE ${keyspace_source}' > ${keyspace_source}.txt"
     ssh ${username}@${start_host} "sed -i 's/${keyspace_source}/${keyspace_destination}/g' ${keyspace_source}.txt"
     ssh ${username}@${start_host} "cqlsh --ssl -f '${keyspace_source}.txt'"
     ssh ${username}@${start_host} "rm ${keyspace_source}.txt"      
     echo "Success on Keyspace Schema clone: "${start_host}
    else
      echo "Failed on Keyspace Schema clone: "${start_host}
    fi

    for hostname in $(echo $hostnames | sed "s/,/ /g")
    do
       if ssh ${username}@${hostname} "echo 2>&1"; then
        ssh ${username}@${hostname} "nodetool snapshot -t copy ${keyspace_source}"
        echo "Success on Cassandra snapshot: "${hostname} ${keyspace_source}

        sleep 20

        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/acls-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/acls-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/apis-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/apis-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/ca_certificates-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/ca_certificates-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/certificates-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/certificates-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/cluster_events-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/cluster_events-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/consumers-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/consumers-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/jwt_secrets-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/jwt_secrets-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/keyauth_credentials-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/keyauth_credentials-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/locks-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/locks-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_authorization_codes-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_authorization_codes-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_credentials-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_credentials-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_tokens-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_tokens-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/plugins-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/plugins-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/ratelimiting_metrics-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/ratelimiting_metrics-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/routes-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/routes-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/schema_meta-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/schema_meta-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/services-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/services-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/snis-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/snis-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/tags-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/tags-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/targets-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/targets-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/upstreams-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/upstreams-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/workspaces-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/workspaces-*/"

        ssh ${username}@${hostname} "nodetool clearsnapshot -t copy"

        echo "Success on Cassandra mv of snapshot files to destination: "${keyspace_destination}

        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} ratelimiting_metrics"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_tokens"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} locks"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} snis"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} plugins"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} targets"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} consumers"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} upstreams"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} schema_meta"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} tags"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} acls"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} cluster_events"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} workspaces"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} jwt_secrets"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} apis"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} keyauth_credentials"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_credentials"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} certificates"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} services"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} ca_certificates"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_authorization_codes"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} routes"

        echo "Success on Cassandra nodetool refresh on destination keyspace: "${keyspace_destination}
       else
        echo "Failed on Cassandra snapshot: "${hostname}
       fi
    done

    if ssh ${username}@${start_host} "echo 2>&1"; then
     ssh ${username}@${start_host} "nodetool repair -full ${keyspace_destination}"
     echo "Success on Keyspace repair: "${start_host} ${keyspace_destination}
    else
     echo "Failed on Keyspace repair : "${start_host} ${keyspace_destination}
    fi

    echo "Script processing completed!"
}

cloneKeyspace "${1}" "${2}" "${3}" "${4}" "${5}"

So to get the creds to import and show up when I query the table properly afterward I do:

***@cqlsh> COPY kong_src.oauth2_credentials TO 'oauth2_credentials.dat';
Using 1 child processes

Starting copy of kong_desto.oauth2_credentials with columns [id, client_id, client_secret, client_type, consumer_id, created_at, hash_secret, name, redirect_uris, tags].
Processed: 81 rows; Rate:      46 rows/s; Avg. rate:      23 rows/s
81 rows exported to 1 files in 3.563 seconds.
****@cqlsh> COPY kong_desto.oauth2_credentials FROM 'oauth2_credentials.dat';

Welp trying it out now as is:

First the migrations up(which so far has never given my grief):

kong migrations up --db-timeout 120 --vv

2020/10/02 02:39:50 [debug] loading subsystems migrations...
2020/10/02 02:39:50 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:50 [verbose] schema state retrieved
2020/10/02 02:39:50 [debug] loading subsystems migrations...
2020/10/02 02:39:50 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:50 [verbose] schema state retrieved
2020/10/02 02:39:50 [debug] migrations to run:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:39:51 [info] migrating core on keyspace 'kong_prod2'...
2020/10/02 02:39:51 [debug] running migration: 009_200_to_210
2020/10/02 02:39:53 [info] core migrated up to: 009_200_to_210 (pending)
2020/10/02 02:39:53 [debug] running migration: 010_210_to_211
2020/10/02 02:39:53 [info] core migrated up to: 010_210_to_211 (pending)
2020/10/02 02:39:53 [debug] running migration: 011_212_to_213
2020/10/02 02:39:53 [info] core migrated up to: 011_212_to_213 (executed)
2020/10/02 02:39:53 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/02 02:39:53 [debug] running migration: 003_200_to_210
2020/10/02 02:39:54 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/10/02 02:39:54 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 003_200_to_210
2020/10/02 02:39:54 [info] acl migrated up to: 003_200_to_210 (pending)
2020/10/02 02:39:54 [debug] running migration: 004_212_to_213
2020/10/02 02:39:54 [info] acl migrated up to: 004_212_to_213 (pending)
2020/10/02 02:39:54 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 004_200_to_210
2020/10/02 02:39:54 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/10/02 02:39:54 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 004_200_to_210
2020/10/02 02:39:56 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/10/02 02:39:56 [debug] running migration: 005_210_to_211
2020/10/02 02:39:56 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/10/02 02:39:56 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:39:59 [verbose] Cassandra schema consensus: reached in 3049ms
2020/10/02 02:39:59 [info] 9 migrations processed
2020/10/02 02:39:59 [info] 2 executed
2020/10/02 02:39:59 [info] 7 pending
2020/10/02 02:39:59 [debug] loading subsystems migrations...
2020/10/02 02:39:59 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:59 [verbose] schema state retrieved
2020/10/02 02:39:59 [info]
Database has pending migrations; run 'kong migrations finish' when ready

Welp migrations finish more problems again, the good news was the 120 seconds was respected it seems, but just never agreed on anything there, look at how fast those core migrations reached consensus too hah(3ms), the other one just hangs :/ :

kong migrations finish --db-timeout 120 --vv

The environment variables are the same from my original post so cutting those out for brevity.

2020/10/02 02:42:24 [debug] loading subsystems migrations...
2020/10/02 02:42:24 [verbose] retrieving keyspace schema state...
2020/10/02 02:42:24 [verbose] schema state retrieved
2020/10/02 02:42:24 [debug] loading subsystems migrations...
2020/10/02 02:42:24 [verbose] retrieving keyspace schema state...
2020/10/02 02:42:25 [verbose] schema state retrieved
2020/10/02 02:42:25 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:42:25 [info] migrating core on keyspace 'kong_prod2'...
2020/10/02 02:42:25 [debug] running migration: 009_200_to_210
2020/10/02 02:42:37 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:42:37 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:42:37 [info] core migrated up to: 009_200_to_210 (executed)
2020/10/02 02:42:37 [debug] running migration: 010_210_to_211
2020/10/02 02:42:37 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:42:37 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:42:37 [info] core migrated up to: 010_210_to_211 (executed)
2020/10/02 02:42:37 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/02 02:42:37 [debug] running migration: 003_200_to_210
2020/10/02 02:42:38 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:44:38 [verbose] Cassandra schema consensus: not reached in 120095ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Going to execute kong migrations finish multiple times until finish "completes" and will check the keyspace.

Second finish, again it did wait the 120 seconds, but didn't complete successfully it seems to reach a consensus:

2020/10/02 02:47:53 [debug] loading subsystems migrations...
2020/10/02 02:47:53 [verbose] retrieving keyspace schema state...
2020/10/02 02:47:53 [verbose] schema state retrieved
2020/10/02 02:47:53 [debug] loading subsystems migrations...
2020/10/02 02:47:53 [verbose] retrieving keyspace schema state...
2020/10/02 02:47:53 [verbose] schema state retrieved
2020/10/02 02:47:53 [debug] pending migrations to finish:
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:47:53 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:47:53 [debug] running migration: 003_200_to_210
2020/10/02 02:48:00 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:50:00 [verbose] Cassandra schema consensus: not reached in 120002ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

3rd :

2020/10/02 02:51:31 [debug] loading subsystems migrations...
2020/10/02 02:51:31 [verbose] retrieving keyspace schema state...
2020/10/02 02:51:32 [verbose] schema state retrieved
2020/10/02 02:51:32 [debug] loading subsystems migrations...
2020/10/02 02:51:32 [verbose] retrieving keyspace schema state...
2020/10/02 02:51:32 [verbose] schema state retrieved
2020/10/02 02:51:32 [debug] pending migrations to finish:
   acl: 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:51:32 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:51:32 [debug] running migration: 004_212_to_213
2020/10/02 02:51:32 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:53:32 [verbose] Cassandra schema consensus: not reached in 120024ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

4th, which oddly succeeded the rest of the way???

2020/10/02 02:54:41 [debug] loading subsystems migrations...
2020/10/02 02:54:41 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:41 [verbose] schema state retrieved
2020/10/02 02:54:41 [debug] loading subsystems migrations...
2020/10/02 02:54:41 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:41 [verbose] schema state retrieved
2020/10/02 02:54:41 [debug] pending migrations to finish:
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:54:41 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/02 02:54:41 [debug] running migration: 004_200_to_210
2020/10/02 02:54:43 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:54:43 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:54:43 [info] oauth2 migrated up to: 004_200_to_210 (executed)
2020/10/02 02:54:43 [debug] running migration: 005_210_to_211
2020/10/02 02:54:44 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:54:44 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:54:44 [info] oauth2 migrated up to: 005_210_to_211 (executed)
2020/10/02 02:54:44 [info] 2 migrations processed
2020/10/02 02:54:44 [info] 2 executed
2020/10/02 02:54:44 [debug] loading subsystems migrations...
2020/10/02 02:54:44 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:44 [verbose] schema state retrieved
2020/10/02 02:54:44 [info] No pending migrations to finish

And a migrations list command after:

2020/10/02 02:57:29 [debug] loading subsystems migrations...
2020/10/02 02:57:29 [verbose] retrieving keyspace schema state...
2020/10/02 02:57:30 [verbose] schema state retrieved
2020/10/02 02:57:30 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
          acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210
       oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211

And a finish command again just to ensure it says no more migrations:

2020/10/02 02:58:12 [debug] loading subsystems migrations...
2020/10/02 02:58:12 [verbose] retrieving keyspace schema state...
2020/10/02 02:58:12 [verbose] schema state retrieved
2020/10/02 02:58:13 [debug] loading subsystems migrations...
2020/10/02 02:58:13 [verbose] retrieving keyspace schema state...
2020/10/02 02:58:13 [verbose] schema state retrieved
2020/10/02 02:58:13 [info] No pending migrations to finish

Time to check the keyspace and see if we got a bunch of workspacs injected in there. It is concerning still I execute multiple times w failures but ultimately if it actually did the right changes I am okay running it multiple times 😆 .

Well some excitement here! It seems I am no longer getting the duplicated workspace id's in the table!!!

[cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
*****@cqlsh> use kong_prod2;
*****@cqlsh:kong_prod2> describe tables;

ratelimiting_metrics  response_ratelimiting_metrics  oauth2_credentials
oauth2_tokens         tags                           certificates
locks                 acls                           services
snis                  cluster_events                 ca_certificates
plugins               workspaces                     oauth2_authorization_codes
targets               jwt_secrets                    routes
consumers             apis                           hmacauth_credentials
upstreams             basicauth_credentials
schema_meta           keyauth_credentials

*****@cqlsh:kong_prod2> select * from workspaces;

 id                                   | comment | config | created_at                      | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
 c709a0a2-687f-43c3-a46a-7bd437dc5c24 |    null |   null | 2020-10-02 02:42:24.000000+0000 | null | default

(1 rows)
kongdba@cqlsh:kong_prod2>

Lets go ahead and verify other tables are using this id all over the place.

ACLS, lgtm with respect to ws id and how its appended:

***@cqlsh:kong_prod2> select * from acls limit 5;

 id                                   | cache_key                                                                                                              | consumer_id                          | created_at                      | group                                | tags | ws_id
--------------------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------------------------------+--------------------------------------+------+--------------------------------------
 187d0487-176b-454c-8a24-9364eafa13e3 | acls:34fe7bbd-7470-4b02-9886-2f0445900ec6:d4d07b18-35d7-402a-a65a-e0f77950de97::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 34fe7bbd-7470-4b02-9886-2f0445900ec6 | 2019-12-16 08:01:59.000000+0000 | d4d07b18-35d7-402a-a65a-e0f77950de97 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 753c13ec-e781-4034-9126-3170c10144e9 | acls:97b8b271-da5e-4aff-afca-bfa0c1dad70f:555bd712-a309-414e-9145-1b3af9a79559::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 97b8b271-da5e-4aff-afca-bfa0c1dad70f | 2020-06-04 19:56:42.000000+0000 | 555bd712-a309-414e-9145-1b3af9a79559 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 4e9a2da9-b453-4c21-899f-3768725bfb6c | acls:aee2dab7-755d-4900-80f6-7c795ae0c51e:a56539e7-b061-48b5-a86b-646aab4548e8::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | aee2dab7-755d-4900-80f6-7c795ae0c51e | 2020-03-27 20:17:44.000000+0000 | a56539e7-b061-48b5-a86b-646aab4548e8 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 ef8a963f-5cbb-4065-9c54-51e85372c50a | acls:39af1af3-4720-46a6-ad3a-13607e835e3f:d7fd7390-d872-47aa-8eba-c1e436ccd08c::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 39af1af3-4720-46a6-ad3a-13607e835e3f | 2020-01-07 11:14:18.000000+0000 | d7fd7390-d872-47aa-8eba-c1e436ccd08c | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 b75330ea-8adc-4124-8406-ed9968115cc3 | acls:191c101d-1394-41c2-8e2a-520cbfb0d505:e2b94d6e-17d3-4757-9022-441a4b28e534::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 191c101d-1394-41c2-8e2a-520cbfb0d505 | 2020-01-13 11:44:06.000000+0000 | e2b94d6e-17d3-4757-9022-441a4b28e534 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(5 rows)
kongdba@cqlsh:kong_prod2>

ROUTES lgtm:

*****@cqlsh:kong_prod2> select * from routes limit 1;

 partition | id                                   | created_at                      | destinations | headers | hosts | https_redirect_status_code | methods | name | path_handling | paths                    | preserve_host | protocols         | regex_priority | service_id                           | snis | sources | strip_path | tags | updated_at                      | ws_id
-----------+--------------------------------------+---------------------------------+--------------+---------+-------+----------------------------+---------+------+---------------+--------------------------+---------------+-------------------+----------------+--------------------------------------+------+---------+------------+------+---------------------------------+--------------------------------------
    routes | 001eeede-9d8c-44fa-8737-e3f07ff9c876 | 2020-01-13 11:44:04.000000+0000 |         null |    null |  null |                        426 |    null | null |            v1 | ['/some/path/v1'] |         False | {'http', 'https'} |              0 | 8afa2fa1-29f5-4337-9b92-5d01814916de | null |    null |       True | null | 2020-01-13 11:44:04.000000+0000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
****@cqlsh:kong_prod2>

Services lgtm:

****@cqlsh:kong_prod2> select * from services limit 1;

 partition | id                                   | ca_certificates | client_certificate_id | connect_timeout | created_at                      | host           | name                                                              | path                                         | port | protocol | read_timeout | retries | tags | tls_verify | tls_verify_depth | updated_at                      | write_timeout | ws_id
-----------+--------------------------------------+-----------------+-----------------------+-----------------+---------------------------------+----------------+-------------------------------------------------------------------+----------------------------------------------+------+----------+--------------+---------+------+------------+------------------+---------------------------------+---------------+--------------------------------------
  services | 005a0644-06b4-4b2d-8eca-6dec316f0113 |            null |                  null |            2000 | 2019-05-23 21:21:04.000000+0000 | service.company.com | c709a0a2-687f-43c3-a46a-7bd437dc5c24:c360.prod.service.here.v5 | /api/my/prod/url/v5.0/search |  443 |    https |         9000 |       0 | null |       null |             null | 2019-05-23 21:21:04.000000+0000 |          9000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
******@cqlsh:kong_prod2>

jwt_secrets, LGTM:

****@cqlsh:kong_prod2> select * from jwt_secrets limit 1;

 id                                   | algorithm | consumer_id                          | created_at                      | key                                                                   | rsa_public_key | secret                           | tags | ws_id
--------------------------------------+-----------+--------------------------------------+---------------------------------+-----------------------------------------------------------------------+----------------+----------------------------------+------+--------------------------------------
 b8fb26c4-ec3d-4d38-be5b-f1cc0e27aa78 |     HS256 | 623021d2-dc20-4402-be44-582607a8fcb3 | 2019-06-05 05:44:10.000000+0000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24:XXXXXXX |           null | XXXXXXXX | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
*****@cqlsh:kong_prod2>

oauth2_credentials, LGTM:


****@cqlsh:kong_prod2> select * from oauth2_credentials limit 1;

 id                                   | client_id                                                             | client_secret                    | client_type | consumer_id                          | created_at                      | hash_secret | name        | redirect_uris        | tags | ws_id
--------------------------------------+-----------------------------------------------------------------------+----------------------------------+-------------+--------------------------------------+---------------------------------+-------------+-------------+----------------------+------+--------------------------------------
 cf2f541c-7864-4be8-a9c8-cc1b57d50399 | c709a0a2-687f-43c3-a46a-7bd437dc5c24:XXXXXX | XXXXXX |        null | XXXXXXX | 2018-08-23 04:49:23.966000+0000 |        null | OAuth_Creds | {'http://company.com'} | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
****@cqlsh:kong_prod2>

Consumers, LGTM:

*****@cqlsh:kong_prod2> select * from consumers limit 1;

 id                                   | created_at                      | custom_id | tags | username                                  | ws_id
--------------------------------------+---------------------------------+-----------+------+-------------------------------------------+--------------------------------------
 4a7a7568-589a-42f6-a488-639dfb1ec21a | 2020-05-12 21:11:13.000000+0000 |      null | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24:username1 | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
*****@cqlsh:kong_prod2>

oauth2_tokens, no example records there sadly hmm maybe they are already garbage collected after my migration and this detective work:

*****@cqlsh:kong_prod2> select * from oauth2_tokens limit 1;

 id | access_token | authenticated_userid | created_at | credential_id | expires_in | refresh_token | scope | service_id | token_type | ws_id
----+--------------+----------------------+------------+---------------+------------+---------------+-------+------------+------------+-------

(0 rows)
*****@cqlsh:kong_prod2>

Will likely run again and check this specific table too by itself after execution of this whole process over again to ensure any existing tokens get proper set ^ .

cluster_events:

****@cqlsh:kong_prod2> select * from cluster_events limit 1;

 channel | at | node_id | id | data | nbf
---------+----+---------+----+------+-----

(0 rows)
****@cqlsh:kong_prod2>

Not modified with a ws_id in table structure huh???? Hmm.

And lastly plugin example, looks good to me:

*****@cqlsh:kong_prod2> select * from plugins limit 1;

 id                                   | api_id | cache_key                                                                                | config                                                                                   | consumer_id | created_at                      | enabled | name | protocols | route_id                             | run_on | service_id | tags | ws_id
--------------------------------------+--------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-------------+---------------------------------+---------+------+-----------+--------------------------------------+--------+------------+------+--------------------------------------
 638b8470-fc50-45f7-964e-2a6155f32c72 |   null | plugins:acl:fb77c07a-4f9f-408f-969c-bdf447aae0ad::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | {"deny":null,"allow":["fb77c07a-4f9f-408f-969c-bdf447aae0ad"],"hide_groups_header":true} |        null | 2019-01-28 18:24:57.000000+0000 |    True |  acl |      null | fb77c07a-4f9f-408f-969c-bdf447aae0ad |  first |       null | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
*****@cqlsh:kong_prod2>

Edit:

So oauth2_tokens is where I am unsure cause I just have no record data to ensure the migration woulda worked for that table right, likely cause the tokens expired during that time I guess, if the token data was somehow not cloned over to my temp keyspace ill need to check that too as well. Also the cluster_events not having a ws_id seems odd to me but guess its how the design is for now(i would think if kong cp's ever have segregations then the ws_id has to be integrated with cluster_events as well with my novice speculating thoughts 😆 ).

So still we don't seem to have a fix for some of the finish commands not able to read and reach a consensus(1 failure). Could it be related to some tables with 1000's of records and paging again??? Like runtime kong C* paging was fixed by that one guy. But could the migrations processes be using wrong paging calls and hitting the failures due to some variable scope problem?

But the good news is it seems now no dup workspaces AND the tables that needed to be modified got modified and modified correctly eh? So seems we are getting positive results here.

I also have 2 environments like this, so I tested the migrations on 1, and am leaving one untouched on the cloned keyspace for now, if you have any further tweaks or logging you want me to add to the files I can as well for a second test as well.

@jeremyjpj0916,

So it really looks that when this happens, it stays like that until you try again. So now I am thinking that should we close connection before we check the schema consensus. The error does not happen on migrations, but it happens on code that is ran between the migrations (which is less worrying). Good to know that multiple ws issue seems to be fixed.

@bungle right, having to execute the finish multiple times is sub-optimal in current state, at least for our use case. Yeah I think honoring the set schema consensus timeout was still a good change, but ultimately I will set that way lower than 120 even because it seems to not be the problem w my cluster, the wait isn't helping any. Do you have a snippit to drop into one of these files that will close the connection and re-open it for checking schema consensus as a test to confirm your suspicions? I can use that on our second prod environment that has a cloned keyspace ready for upgrading 😁 . The perfect fix for me would be one that runs all the way through without me having to execute the finish over and over.

Then after that I will probably do this clone keyspace and upgrade thing like 10-20 times in a row to feel really good that Kong will be stable on the jump from 2.0.5 to 2.1.x(whatever version this ends up 2.1.5?) .

Also important I am sure that any alterations to the migration process while making it work better for a multi DC 6 node c* cluster w high replication and local_q read/write AND still need to be applicable to function okay against a single node etc. As shown in these lastest 2.1.x upgrades it can sometimes be a challenge to make something that fits for best of both worlds it seems. Just for stats these environments have a few thousand service/route resources, like 5k or so plugins and then plenty of oauth2 tokens and a number of acls. Not super large db'es in terms of data size but decent at this point.

But hey at least the changes thus far made the finish command idempotent (@rsbrisci fancy big word, not mine). So the ws id alterations look to be correct!

@jeremyjpj0916, is there any errors logged to Cassandra when this happens? Or close to when it happens?

@bungle will check on this during the work day and get back to yah on if I see anything interesting in the C* logs. Probably should have been doing exactly that when performing these upgrades but it did slip my mind.

Just gonna break down the logs I can see of the node I believe we would have hit for migrations.

First up:

From server05503 looking on the 10/1 date.

/var/log/cassandra/cassandra.log:

Only error on 10/1 came from the keyspace cloning error from my shell script copying oauth2_credentials over(which I end up doing manually using copy tables commands as described earlier):

ERROR [SSTableBatchOpen:1] 2020-10-01 22:29:35,338 CassandraDaemon.java:235 - Exception in thread Thread[SSTableBatchOpen:1,5,main]
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
        at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:338) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:522) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader$3.run(SSTableReader.java:570) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_232]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.8.jar:3.11.8]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]

Next up:

/var/log/cassandra/system.log

ERROR [Reference-Reaper:1] 2020-10-01 20:36:03,866 Ref.java:224 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@69a76445) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1838738857:/data/cassandra/data/kong_prod2/oauth2_credentials-feb3d290044e11ebafdba3cdf7a167d7/mc-1-big was not released before the reference was garbage collected

ERROR [SSTableBatchOpen:1] 2020-10-01 22:29:35,338 CassandraDaemon.java:235 - Exception in thread Thread[SSTableBatchOpen:1,5,main]
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
        at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:338) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:522) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader$3.run(SSTableReader.java:570) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_232]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.8.jar:3.11.8]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]

But both of those logs were prior to the kong migrations finish command that would have ran... those would have been 1 during the keyspace copy for the 2nd error statement, similar to the one found in cassandra.log, then the first error was 2 hours before that and unrelated to the migrations effort. Unless the migrations hit a different node and I need to check each one of the nodes independently, doesn't seem like I can tell which node was used to handle the migrations work in debug prints(unless I removed those lines up above from the logs thinking they would not be relevant prior to posting).

Gonna check the other nodes too.

server05504 had this in /var/log/cassandra/system.log

ERROR [MessagingService-Incoming-/10.204.90.234] 2020-10-01 22:30:41,625 CassandraDaemon.java:228 - Exception in thread Thread[MessagingService-Incoming-/10.204.90.234,5,main]
java.lang.RuntimeException: Unknown column kind during deserialization
        at org.apache.cassandra.db.Columns$Serializer.deserialize(Columns.java:452) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.SerializationHeader$Serializer.deserializeForMessaging(SerializationHeader.java:412) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.deserializeHeader(UnfilteredRowIteratorSerializer.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:835) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:823) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:669) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:652) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.2.jar:3.11.2]
ERROR [MessagingService-Incoming-/10.204.90.234] 2020-10-01 22:30:42,569 CassandraDaemon.java:228 - Exception in thread Thread[MessagingService-Incoming-/10.204.90.234,5,main]
java.lang.RuntimeException: Unknown column kind during deserialization
        at org.apache.cassandra.db.Columns$Serializer.deserialize(Columns.java:452) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.SerializationHeader$Serializer.deserializeForMessaging(SerializationHeader.java:412) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.deserializeHeader(UnfilteredRowIteratorSerializer.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:835) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:823) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:669) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:652) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.2.jar:3.11.2]

But again I believe thats not even during the migrations executions but likely during keyspace clone related stuff because of timestamp aligning with the redirect_uri error from other server above(granted I don't remember the "kind" error showing up in output prior, only the redirect_uri one associated with oauth2 creds on keyspace clone and data copy over)...

Checking other final node in this DC for local quorum oddities. EDIT - Nope 0 logs extra there....

Some extra coordinator settings in my cassandra.yaml from a given node:

# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 10000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 5000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 10000

@bungle so the fix that prevents the dup workspaces is going into 2.2.0 now and not a 2.1.5 patch correct? So I should wait and upgrade my prod nodes potentially from 2.0.5 to 2.2.0 to not run into the ws migration errors I have been seeing(even if we can't get to the bottom of why the finish does not execute smoothly throughout for us and we have to run finish multiple times).

@jeremyjpj0916 the removal of migration helpers and passing of connection was considered breaking, so it could not go to 2.1.5 anymore. In my opinion that is overly cautious, but I guess better to be safe than sorry. Overly cautious in that it could only break custom plugins that use the undocumented helpers (from which we have already removed functions in a past without a notice... I just saw that it was not used anywhere, thus removed it and instead passed the connection/coordinator to teardown). Singature before (connector, undocumented_helpers), signature now (connector, connection/coordinator).

Perhaps instead of connector, connection/coordinator we should just pass migration_executor. But that is a far bigger change and even more breaking.

These commits are the problematic for patch: https://github.com/Kong/kong/pull/6411/commits/998d13efb14b11290abd411397ed051d2208e26b https://github.com/Kong/kong/pull/6411/commits/ed9f373d94ea5ce98d250319bd390093812368ed

We could perhaps try to backport or modify pr somehow to the workspace prevention code to patch. The wait for schema consensus is still something I am not sure if that is fixed with e.g. passing of the coordinator. Still it somehow feels like the connection becomes unstable. after running some migrations and then it cannot be used anymore to call wait for schema consensus. Perhaps we should not use connector there either, but just call coordinator. But then, what if that coordinator becomes unstable. I have tried to track down if any of our code calls :setkeepalive or :close to connection that is grabbed before we call wait_for_schema_consensus but I could not find anything.

Restarting finnish surely makes all the internal structures clean, and grabs a new connection... but still I do not understand why connection goes bad (can it become bad because of network issue or something, or after running too many migrations).

@bungle

Gotcha, I suppose I can also just roll with 2.1.5 with the exact files you had me test with your pending PR and roll that up through prod then end up moving onto 2.2.x at a later time as well, otherwise the jumping will get bigger and I would need to vet 2.2.x for problems and I think 2.0.5 to 2.1.5 is a big enough jump as is for prod.

Yeah was hoping C logs might reveal a problem, but so far no dice, if Kong team gets any ideas for a snippet of a debug print line either in Kong layer or in the underlying C client lib that would be insightful to help get an RCA let me know and I will give that a test run too.

@jeremyjpj0916, the PR I did was not mergeable to anywhere ;-(. So back to drawing board, here is a fix that can go to 2.1.5 and to 2.2.0: https://github.com/Kong/kong/pull/6439

Were these the same changes ultimately or a retest would be helpful to you here?

@jeremyjpj0916,

If you have time to try out #6439 and give us feedback it would be great. I do expect the multiple default workspace issue be solved (there are two fixes to that: 1. try to create it only once, 2. use proper consistency when creating and reading). The other issue is trickier. I found a couple of new places, but ultimately I don't think I found anything that could solve it. I didn't do reconnect because it feels like a workaround that just hides a possible issue that needs to be fixed. The biggest change is that we never call connect_migrations on migrations anymore (only get_stored_connection), and let framework to handle the connections. As always thanks for cooperating on this, all the feedback is welcomed. I had to rework the PR a bit so it is also good to give us feedback if something went wrong with the re-implementation (I tried to be careful).

Still it feels like the connection gets corrupted at some point. But it is strange where it does so: migration ok, recording migration ok, but waiting for schema consensus not ok). Perhaps it is a Cassandra server that closes connection. Something in lines of this: https://stackoverflow.com/questions/39343475/how-does-cassandra-server-identifies-crashed-client-connections. Could it be that there is some migrations that make Cassandra think that client has crashed or something.

One thing to note, on your last try the jwt failed once and acl plugin failed twice. Some notes about ACL migration:

acl has composite cache key
the acl has foreign key
the acl uniques is uniques = {}

Anything else in your migrations went ok (on that last try).

The only thing I think that those migrations do is that they run UPDATEs while iterating the table. So could it be that iterating goes wrong when you have enough entities if you at the same time modify that table a lot. @jeremyjpj0916, I'll create you a separate branch that splits iterate update to two loops, so let's see if that makes a difference.

@jeremyjpj0916,

this branch has one additional commit: https://github.com/Kong/kong/commits/fix/migrations-master-iteration-separated

That moves updates out of iteration. This does it only for those migrations that you saw problematic. There are older migrations that have similar pattern of update inside iterator. If you could do a test run with #6439 and fix/migrations-master-iteration-separated. All results are good as they close possibilities.

Gotcha, will give both of those a go and let you know what I see. I am revamping our keyspace migration scripting to be able to pull down that keyspace and data and clone it on a given different env of infra (been cloning on prod infra and we are in peak season so it seems wrong to do so haha. Gonna use ansible and push and pull some files around for my v2 iteration). Should have some info later this week or early next week from testings.

Thanks again for your diligence @bungle .

Rewrote my copy keyspace as an Ansible script that can send it from a src C cluster node to a totally different C cluster node(assuming the keyspace does not already exist, no checks or anything fancy here):

---
- hosts: src_cassandra_node
  become: true
  tasks:
  - name: Make data directory for the clone files
    file:
      path: /data/cassandra/clone
      state: directory
  - name: Move into the C* folder we intend to put all our files and Export keyspace schema
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e 'DESCRIBE KEYSPACE {{ KEYSPACE_SOURCE_NAME }}' > {{ KEYSPACE_SOURCE_NAME }}.txt"
  - name: Alter keyspace schema name for later import in other cluster
    command: bash -c "cd /data/cassandra/clone && sed -i 's/{{ KEYSPACE_SOURCE_NAME }}/{{ KEYSPACE_DESTINATION_NAME }}/g' {{ KEYSPACE_SOURCE_NAME }}.txt"
  - name: Export table data into logical files for import later
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_SOURCE_NAME }}; describe tables;\" |  tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_SOURCE_NAME }}.% TO '%.dat'\""
    register: export_table_data
  - debug: msg="{{ export_table_data.stdout_lines }}"
  - name: Make local sure local directory is empty
    local_action: file path=/data/cassandra/clone state=absent
  - name: Make local temp local directory for cassandra files on controller
    local_action: command mkdir -p /data/cassandra/clone
  - shell: (cd /data/cassandra/clone; find . -maxdepth 1 -type f) | cut -d'/' -f2
    register: files_to_copy
  - fetch: src=/data/cassandra/clone/{{ item }} dest=/data/cassandra/clone/ flat=yes
    with_items: "{{ files_to_copy.stdout_lines }}"
  - name: Remove files and directory from remote host
    command: rm -rf /data/cassandra/clone

- hosts: dest_cassandra_node
  become: true
  tasks:
  - name: Remove files and directory from remote host
    command: rm -rf /data/cassandra/clone
  - name: Make data directory for the clone files
    file:
      path: /data/cassandra/clone
      state: directory
  - name: Move Cassandra files to destination C* node
    copy:
      src: /data/cassandra/clone/
      dest: /data/cassandra/clone
      owner: cassandra
      group: cassandra
      mode: 0755
  - name: Make local sure local ansible controller directory is emptied afterward
    local_action: file path=/data/cassandra/clone state=absent
  - name: Build new keyspace and schema
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -f '{{ KEYSPACE_SOURCE_NAME }}.txt'"
  - name: Import data to new keyspace
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_DESTINATION_NAME }}; describe tables;\" |  tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_DESTINATION_NAME }}.% FROM '%.dat'\""
    register: import_table_data
  - debug: msg="{{ import_table_data.stdout_lines }}"  
  - name: Repair newly made clone keyspace for good measure
    command: bash -c "nodetool repair -full {{ KEYSPACE_DESTINATION_NAME }}"
  - name: Clean out temp keyspace files and data on remote C* destination
    command: rm -rf /data/cassandra/clone

I will be resuming the new bungle PR testing here shortly tomorrow and into the weekend/Monday!

Much faster execution than my first go shell script hack too.

First go at #6439 PR against my cloned prod 2.0.5 keyspace and its data. Made sure I had some oauth2 tokens present and valid in the db this time too.

kong migrations list --vv

2020/10/13 05:12:30 [debug] loading subsystems migrations...
2020/10/13 05:12:30 [verbose] retrieving keyspace schema state...
2020/10/13 05:12:30 [verbose] schema state retrieved
2020/10/13 05:12:30 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200
          jwt: 000_base_jwt, 002_130_to_140
          acl: 000_base_acl, 002_130_to_140
rate-limiting: 000_base_rate_limiting, 003_10_to_112
       oauth2: 000_base_oauth2, 003_130_to_140
2020/10/13 05:12:30 [info]
New migrations available:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:12:30 [info]
Run 'kong migrations up' to proceed

kong migrations up --db-timeout 20 --vv

2020/10/13 05:13:55 [debug] loading subsystems migrations...
2020/10/13 05:13:55 [verbose] retrieving keyspace schema state...
2020/10/13 05:13:55 [verbose] schema state retrieved
2020/10/13 05:13:56 [debug] loading subsystems migrations...
2020/10/13 05:13:56 [verbose] retrieving keyspace schema state...
2020/10/13 05:13:56 [verbose] schema state retrieved
2020/10/13 05:13:56 [debug] migrations to run:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:13:56 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:13:56 [debug] running migration: 009_200_to_210
2020/10/13 05:14:02 [info] core migrated up to: 009_200_to_210 (pending)
2020/10/13 05:14:02 [debug] running migration: 010_210_to_211
2020/10/13 05:14:02 [info] core migrated up to: 010_210_to_211 (pending)
2020/10/13 05:14:02 [debug] running migration: 011_212_to_213
2020/10/13 05:14:02 [info] core migrated up to: 011_212_to_213 (executed)
2020/10/13 05:14:02 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/13 05:14:02 [debug] running migration: 003_200_to_210
2020/10/13 05:14:03 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/10/13 05:14:03 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 003_200_to_210
2020/10/13 05:14:03 [info] acl migrated up to: 003_200_to_210 (pending)
2020/10/13 05:14:03 [debug] running migration: 004_212_to_213
2020/10/13 05:14:03 [info] acl migrated up to: 004_212_to_213 (pending)
2020/10/13 05:14:03 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 004_200_to_210
2020/10/13 05:14:03 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/10/13 05:14:03 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 004_200_to_210
2020/10/13 05:14:05 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/10/13 05:14:05 [debug] running migration: 005_210_to_211
2020/10/13 05:14:05 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/10/13 05:14:05 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/13 05:14:07 [verbose] Cassandra schema consensus: reached in 2537ms
2020/10/13 05:14:07 [info] 9 migrations processed
2020/10/13 05:14:07 [info] 2 executed
2020/10/13 05:14:07 [info] 7 pending
2020/10/13 05:14:07 [debug] loading subsystems migrations...
2020/10/13 05:14:07 [verbose] retrieving keyspace schema state...
2020/10/13 05:14:08 [verbose] schema state retrieved
2020/10/13 05:14:08 [info]
Database has pending migrations; run 'kong migrations finish' when ready
/ $

First run of finish: kong migrations finish --db-timeout 20 --vv

2020/10/13 05:15:11 [debug] loading subsystems migrations...
2020/10/13 05:15:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:11 [verbose] schema state retrieved
2020/10/13 05:15:11 [debug] loading subsystems migrations...
2020/10/13 05:15:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:11 [verbose] schema state retrieved
2020/10/13 05:15:11 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:15:12 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:15:12 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Second run of finish: kong migrations finish --db-timeout 20 --vv

2020/10/13 05:15:53 [debug] loading subsystems migrations...
2020/10/13 05:15:53 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:53 [verbose] schema state retrieved
2020/10/13 05:15:53 [debug] loading subsystems migrations...
2020/10/13 05:15:53 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:54 [verbose] schema state retrieved
2020/10/13 05:15:54 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:15:54 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:15:54 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Third run of finish: kong migrations finish --db-timeout 20 --vv

2020/10/13 05:16:29 [debug] loading subsystems migrations...
2020/10/13 05:16:29 [verbose] retrieving keyspace schema state...
2020/10/13 05:16:30 [verbose] schema state retrieved
2020/10/13 05:16:30 [debug] loading subsystems migrations...
2020/10/13 05:16:30 [verbose] retrieving keyspace schema state...
2020/10/13 05:16:30 [verbose] schema state retrieved
2020/10/13 05:16:30 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:16:31 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:16:31 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Fourth run of finish: kong migrations finish --db-timeout 20 --vv

2020/10/13 05:17:11 [debug] loading subsystems migrations...
2020/10/13 05:17:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:17:12 [verbose] schema state retrieved
2020/10/13 05:17:12 [debug] loading subsystems migrations...
2020/10/13 05:17:12 [verbose] retrieving keyspace schema state...
2020/10/13 05:17:12 [verbose] schema state retrieved
2020/10/13 05:17:12 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:17:12 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:17:12 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Umm @bungle its now erring every time on trying to find a default workspace. Well lets check the keyspace:

# cqlsh --ssl
Connected to Stargate at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.8 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
*****@cqlsh> use kong_prod2;
*****@cqlsh:kong_prod2> select * from workspaces;

 id                                   | comment | config | created_at                      | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
 926762fb-7c62-4ddb-8535-5577af0cefac |    null |   null | 2020-10-13 05:15:11.000000+0000 | null | default

(1 rows)
*****@cqlsh:kong_prod2>

Def was injected at 05:15:11 by the first migrations finish I imagine, no idea why the rest of the processing and subsequent finish commands are not seeing it.

Did a diff on 009_200_to_210 core migration file just to make sure your patch file was indeed in there and identical and it was. Seems whatever logic was dropped in this smaller PR caused it to not be as effective or introduced a new bug preventing the finding of the injected workspace.

Will hold off on the other pending branch you have until you tell me to test it if you feel 6439 needs more attention. If you want me to go ahead with your other experimental branch with changes I can do so as well, let me know how you prefer to proceed.

Edit - Looks like we land here: https://github.com/Kong/kong/blob/0d64a67d34d40e0facd66b015c46f8a6ba994f71/kong/db/migrations/core/009_200_to_210.lua#L251 , wonder why the prior: local default_ws, err = operations.cassandra_ensure_default_ws(coordinator) isn't returning an err, its just got default_ws as nil I suppose... when I am able to query and see it just fine in the C* db itself on all nodes in that datacenter it could possibly be using.

@jeremyjpj0916, the query is here: https://github.com/Kong/kong/blob/0d64a67d34d40e0facd66b015c46f8a6ba994f71/kong/db/migrations/operations/200_to_210.lua#L24

I like that we got a new problem as that might make us find the real issue. It seems like query did not return rows. Also it seems like that query does not return rows even when you start the migration again.

Is this query wrong:

SELECT id FROM workspaces WHERE name='default'

Can you execute it on cql?

@bungle, Yep works like a charm, so I guess the question is why does the cql statement not work in the context of Kong for me, and for multiple runs even of migration finish. And what is different about your first big PR that got scratched that did cause migrations to seemingly work and get past this step, vs what we have now that gets stuck on it in pr 6439:

***@cqlsh:kong_prod2> select * from workspaces;

 id                                   | comment | config | created_at                      | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
 926762fb-7c62-4ddb-8535-5577af0cefac |    null |   null | 2020-10-13 05:15:11.000000+0000 | null | default

(1 rows)
***@cqlsh:kong_prod2> SELECT id FROM workspaces WHERE name='default';

 id
--------------------------------------
 926762fb-7c62-4ddb-8535-5577af0cefac

(1 rows)

Kong / kong