Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/
Apache License 2.0
39.4k stars 4.83k forks source link

Brief Kong Startup Error logs #3417

Closed jeremyjpj0916 closed 6 years ago

jeremyjpj0916 commented 6 years ago

Summary

Seeing some Errors on Kong start up briefly for a second, does not seem to concerning but I brought it up in forum and Teiski wanted full logs so here you have them:

2018/04/25 18:45:22 [verbose] Kong: 0.13.1
2018/04/25 18:45:22 [debug] ngx_lua: 10011
2018/04/25 18:45:22 [debug] nginx: 1013006
2018/04/25 18:45:22 [debug] Lua: LuaJIT 2.1.0-beta3
2018/04/25 18:45:22 [verbose] no config file found at /etc/kong/kong.conf
2018/04/25 18:45:22 [verbose] no config file found at /etc/kong.conf
2018/04/25 18:45:22 [verbose] no config file, skipping loading
2018/04/25 18:45:22 [debug] KONG_ERROR_DEFAULT_TYPE ENV found with "text/plain"
2018/04/25 18:45:22 [debug] KONG_ADMIN_LISTEN ENV found with "0.0.0.0:8001"
2018/04/25 18:45:22 [debug] KONG_PROXY_ACCESS_LOG ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_PREFIX ENV found with "/usr/local/kong"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_USERNAME ENV found with "myUser"
2018/04/25 18:45:22 [debug] KONG_SSL_CERT_KEY ENV found with "/usr/local/kong/ssl/kongprivatekey.key"
2018/04/25 18:45:22 [debug] KONG_MEM_CACHE_SIZE ENV found with "1024m"
2018/04/25 18:45:22 [debug] KONG_SERVER_TOKENS ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_PASSWORD ENV found with "******"
2018/04/25 18:45:22 [debug] KONG_CUSTOM_PLUGINS ENV found with "kong-upstream-jwt,stdout-log,optum-kong-http-log-plugin,optum-kong-oidc-plugin"
2018/04/25 18:45:22 [debug] KONG_PROXY_LISTEN ENV found with "0.0.0.0:8000, 0.0.0.0:8443 ssl http2"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_SSL ENV found with "on"
2018/04/25 18:45:22 [debug] KONG_DB_UPDATE_FREQUENCY ENV found with "5"
2018/04/25 18:45:22 [debug] KONG_DB_UPDATE_PROPAGATION ENV found with "5"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_PORT ENV found with "9042"
2018/04/25 18:45:22 [debug] KONG_DNS_ORDER ENV found with "LAST,SRV,A,CNAME"
2018/04/25 18:45:22 [debug] KONG_DNS_ERROR_TTL ENV found with "1"
2018/04/25 18:45:22 [debug] KONG_DNS_STALE_TTL ENV found with "4"
2018/04/25 18:45:22 [debug] KONG_DATABASE ENV found with "cassandra"
2018/04/25 18:45:22 [debug] KONG_UPSTREAM_KEEPALIVE ENV found with "120"
2018/04/25 18:45:22 [debug] KONG_ADMIN_ACCESS_LOG ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_LUA_SOCKET_POOL_SIZE ENV found with "30"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_LOCAL_DATACENTER ENV found with "ELR"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_CONSISTENCY ENV found with "ONE"
2018/04/25 18:45:22 [debug] KONG_DB_CACHE_TTL ENV found with "10800"
2018/04/25 18:45:22 [debug] KONG_ADMIN_ERROR_LOG ENV found with "/dev/stderr"
2018/04/25 18:45:22 [debug] KONG_DNS_NOT_FOUND_TTL ENV found with "30"
2018/04/25 18:45:22 [debug] KONG_PG_SSL ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_LATENCY_TOKENS ENV found with "on"
2018/04/25 18:45:22 [debug] KONG_LUA_SSL_VERIFY_DEPTH ENV found with "3"
2018/04/25 18:45:22 [debug] KONG_LUA_SSL_TRUSTED_CERTIFICATE ENV found with "/usr/local/kong/ssl/kongcert.pem"
2018/04/25 18:45:22 [debug] KONG_DNS_HOSTSFILE ENV found with "/etc/hosts"
2018/04/25 18:45:22 [debug] KONG_DNS_NO_SYNC ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_SSL_CERT ENV found with "/usr/local/kong/ssl/kongcert.crt"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_TIMEOUT ENV found with "5000"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_SSL_VERIFY ENV found with "on"
2018/04/25 18:45:22 [debug] KONG_LOG_LEVEL ENV found with "notice"
2018/04/25 18:45:22 [debug] KONG_REAL_IP_RECURSIVE ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_NGINX_DAEMON ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_ANONYMOUS_REPORTS ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_PROXY_ERROR_LOG ENV found with "/dev/stderr"
2018/04/25 18:45:22 [debug] KONG_CLIENT_SSL ENV found with "off"
2018/04/25 18:45:22 [debug] KONG_CLIENT_BODY_BUFFER_SIZE ENV found with "20m"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_CONTACT_POINTS ENV found with "server1,server2,server3,server4,server5,server6"
2018/04/25 18:45:22 [debug] KONG_REAL_IP_HEADER ENV found with "X-Real-IP"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_KEYSPACE ENV found with "kong_dev"
2018/04/25 18:45:22 [debug] KONG_CLIENT_MAX_BODY_SIZE ENV found with "1000m"
2018/04/25 18:45:22 [debug] KONG_CASSANDRA_LB_POLICY ENV found with "DCAwareRoundRobin"
2018/04/25 18:45:22 [debug] admin_access_log = "off"
2018/04/25 18:45:22 [debug] admin_error_log = "/dev/stderr"
2018/04/25 18:45:22 [debug] admin_listen = {"0.0.0.0:8001"}
2018/04/25 18:45:22 [debug] anonymous_reports = false
2018/04/25 18:45:22 [debug] cassandra_consistency = "ONE"
2018/04/25 18:45:22 [debug] cassandra_contact_points = {"server1","server2","server3","server4","server5","server6"}
2018/04/25 18:45:22 [debug] cassandra_data_centers = {"dc1:2","dc2:3"}
2018/04/25 18:45:22 [debug] cassandra_keyspace = "kong_dev"
2018/04/25 18:45:22 [debug] cassandra_lb_policy = "DCAwareRoundRobin"
2018/04/25 18:45:22 [debug] cassandra_local_datacenter = "ELR"
2018/04/25 18:45:22 [debug] cassandra_password = "******"
2018/04/25 18:45:22 [debug] cassandra_port = 9042
2018/04/25 18:45:22 [debug] cassandra_repl_factor = 1
2018/04/25 18:45:22 [debug] cassandra_repl_strategy = "SimpleStrategy"
2018/04/25 18:45:22 [debug] cassandra_schema_consensus_timeout = 10000
2018/04/25 18:45:22 [debug] cassandra_ssl = true
2018/04/25 18:45:22 [debug] cassandra_ssl_verify = true
2018/04/25 18:45:22 [debug] cassandra_timeout = 5000
2018/04/25 18:45:22 [debug] cassandra_username = "myUser"
2018/04/25 18:45:22 [debug] client_body_buffer_size = "20m"
2018/04/25 18:45:22 [debug] client_max_body_size = "1000m"
2018/04/25 18:45:22 [debug] client_ssl = false
2018/04/25 18:45:22 [debug] custom_plugins = {"kong-upstream-jwt","stdout-log","optum-kong-http-log-plugin","optum-kong-oidc-plugin"}
2018/04/25 18:45:22 [debug] database = "cassandra"
2018/04/25 18:45:22 [debug] db_cache_ttl = 10800
2018/04/25 18:45:22 [debug] db_update_frequency = 5
2018/04/25 18:45:22 [debug] db_update_propagation = 5
2018/04/25 18:45:22 [debug] dns_error_ttl = 1
2018/04/25 18:45:22 [debug] dns_hostsfile = "/etc/hosts"
2018/04/25 18:45:22 [debug] dns_no_sync = false
2018/04/25 18:45:22 [debug] dns_not_found_ttl = 30
2018/04/25 18:45:22 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2018/04/25 18:45:22 [debug] dns_resolver = {}
2018/04/25 18:45:22 [debug] dns_stale_ttl = 4
2018/04/25 18:45:22 [debug] error_default_type = "text/plain"
2018/04/25 18:45:22 [debug] latency_tokens = true
2018/04/25 18:45:22 [debug] log_level = "notice"
2018/04/25 18:45:22 [debug] lua_package_cpath = ""
2018/04/25 18:45:22 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2018/04/25 18:45:22 [debug] lua_socket_pool_size = 30
2018/04/25 18:45:22 [debug] lua_ssl_trusted_certificate = "/usr/local/kong/ssl/kongcert.pem"
2018/04/25 18:45:22 [debug] lua_ssl_verify_depth = 3
2018/04/25 18:45:22 [debug] mem_cache_size = "1024m"
2018/04/25 18:45:22 [debug] nginx_daemon = "off"
2018/04/25 18:45:22 [debug] nginx_optimizations = true
2018/04/25 18:45:22 [debug] nginx_user = "nobody nobody"
2018/04/25 18:45:22 [debug] nginx_worker_processes = "auto"
2018/04/25 18:45:22 [debug] pg_database = "kong"
2018/04/25 18:45:22 [debug] pg_host = "127.0.0.1"
2018/04/25 18:45:22 [debug] pg_port = 5432
2018/04/25 18:45:22 [debug] pg_ssl = false
2018/04/25 18:45:22 [debug] pg_ssl_verify = false
2018/04/25 18:45:22 [debug] pg_user = "kong"
2018/04/25 18:45:22 [debug] prefix = "/usr/local/kong"
2018/04/25 18:45:22 [debug] proxy_access_log = "off"
2018/04/25 18:45:22 [debug] proxy_error_log = "/dev/stderr"
2018/04/25 18:45:22 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl http2"}
2018/04/25 18:45:22 [debug] real_ip_header = "X-Real-IP"
2018/04/25 18:45:22 [debug] real_ip_recursive = "off"
2018/04/25 18:45:22 [debug] server_tokens = false
2018/04/25 18:45:22 [debug] ssl_cert = "/usr/local/kong/ssl/kongcert.crt"
2018/04/25 18:45:22 [debug] ssl_cert_key = "/usr/local/kong/ssl/kongprivatekey.key"
2018/04/25 18:45:22 [debug] ssl_cipher_suite = "modern"
2018/04/25 18:45:22 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2018/04/25 18:45:22 [debug] trusted_ips = {}
2018/04/25 18:45:22 [debug] upstream_keepalive = 120
2018/04/25 18:45:22 [verbose] prefix in use: /usr/local/kong
2018/04/25 18:45:22 [verbose] preparing nginx prefix directory at /usr/local/kong
Info: Using DT_HOME: /var/lib/dynatrace/oneagent
2018/04/25 18:45:53 [notice] 14#0: using the "epoll" event method
2018/04/25 18:45:53 [notice] 14#0: openresty/1.13.6.1
2018/04/25 18:45:53 [notice] 14#0: built by gcc 6.3.0 (Alpine 6.3.0) 
2018/04/25 18:45:53 [notice] 14#0: OS: Linux 3.10.0-693.11.1.el7.x86_64
2018/04/25 18:45:53 [notice] 14#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2018/04/25 18:45:53 [notice] 14#0: start worker processes
2018/04/25 18:45:53 [notice] 14#0: start worker process 36
2018/04/25 18:45:53 [notice] 14#0: start worker process 37
2018/04/25 18:45:53 [notice] 14#0: start worker process 38
2018/04/25 18:45:53 [notice] 14#0: start worker process 39
2018/04/25 18:45:53 [notice] 14#0: start worker process 40
2018/04/25 18:45:53 [notice] 14#0: start worker process 41
2018/04/25 18:45:53 [notice] 14#0: start worker process 42
2018/04/25 18:45:53 [notice] 14#0: start worker process 43
2018/04/25 18:45:58 [error] 39#0: *21 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 4, context: ngx.timer
2018/04/25 18:45:58 [error] 39#0: *21 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 5, context: ngx.timer
2018/04/25 18:45:58 [error] 39#0: *21 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 6, context: ngx.timer
2018/04/25 18:45:58 [error] 39#0: *21 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 7, context: ngx.timer
2018/04/25 18:45:58 [error] 37#0: *24 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 5, context: ngx.timer
2018/04/25 18:45:58 [error] 37#0: *24 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 6, context: ngx.timer
2018/04/25 18:45:58 [error] 37#0: *24 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 7, context: ngx.timer
2018/04/25 18:45:58 [error] 37#0: *24 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 8, context: ngx.timer
2018/04/25 18:45:58 [error] 42#0: *27 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 7, context: ngx.timer
2018/04/25 18:45:58 [error] 42#0: *27 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 8, context: ngx.timer
2018/04/25 18:45:58 [error] 41#0: *30 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 8, context: ngx.timer
2018/04/25 18:45:59 [error] 39#0: *21 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 8, context: ngx.timer
2018/04/25 18:45:59 [error] 36#0: *36 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 8, context: ngx.timer
2018/04/25 18:45:59 [error] 38#0: *37 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 8, context: ngx.timer
2018/04/25 18:45:59 [error] 40#0: *38 [lua] events.lua:124: errlog(): worker-events: dropping event; waiting for event data timed out, id: 8, context: ngx.timer

We run Kong in 2 DC's w/ Cassandra cluster. Kong runs fine after those prints have occurred no issue but thought to bring it up in case there may be something Kong needs to optimize or tweak in the phases of the app.

Original forum post: https://discuss.konghq.com/t/any-cause-for-concerns-here/879

@Tieske Let me know if you would like any further info!

Tieske commented 6 years ago

When Kong starts it initializes the lua-resty-worker-events library for cross-worker events. This is done in the init_worker phase. That library also posts an event by itself, a started event. So in this case there are 8 workers, so there are 8 started events. And usually they get ID 1 through 8.

As we can see in the logs, those are the ones that get lost. Since Kong itself does not use those, this ends up being harmless.

Question is why this happens?

The events library is set to poll every 1 second, and retain events for 5 seconds. See https://github.com/Kong/kong/blob/0.13.1/kong/init.lua#L212-L213 Polling for events is done in a timer. The problem now occurs because while the timer gets scheduled in the init_worker phase, it does not actually run. The init_worker phase cannot yield, hence the timers will only be executed when the phase is completed.

So the initialization takes more than 5 seconds, and hence some of the events will have been lost by the time the timer finally executes the poll for events. And hence we see the dropping event errors.

So far I've only seen it with Cassandra as a datastore. So my guess is that the initialization of the dao is the culprit (of the slow init_worker phase).

Possible solutions:

  1. rearrange startup order, first initialized dao and only after initialize worker-events
  2. extra setting in worker-events to enable a deviating ttl during the init_worker phase

@thibaultcha any other ideas?

thibaultcha commented 6 years ago

@Tieske I think the explanation is on point and makes sense, but I do not see how the init worker could take more than 5s to execute without yielding...

There is no initialization with Cassandra in init_worker - that is used by the PostgreSQL strategy for its TTL implementation.

This is going to be hard to debug without a reproducible case... Luckily it is not harmful.

thibaultcha commented 6 years ago

Oh, it just struck me that init_worker does not yield anyway. It is interesting that the last two workers do not observe this error as well.

jeremyjpj0916 commented 6 years ago

If either of you would like me to drop in a modified core file into my local build and let you know the behavior in our dev environment I can tell you if anything "fixes" the problem. Just show me a diff 😄 . Then again if its truly harmless I don't want you to waste your time on it, just figured if it happens for others getting these statements to stop will save people asking the same thing over again later.

thibaultcha commented 6 years ago

A culprit I have in mind so far would be some blocking I/O somewhere (e.g. using LuaSocket) thus preventing the nginx event loop from updating its time, and never firing the timer until the init_worker phase is over. But that is just a theory.

thibaultcha commented 6 years ago

@jeremyjpj0916 Do you have any plugin executing code in their init_worker phase?

jeremyjpj0916 commented 6 years ago

@thibaultcha Sorry to be dense on the matter, but can you give me an example plugin Kong currently has that does so? What am I looking for exactly? I run a global variant of the http-log plugin Kong offers, I just format the data to go into Splunk so no real difference there. I also do use the unmodified statsD plugin Kong has globally. And I have noticed lines of code in the statsD plugin that had me scratching my head like so:

https://github.com/Kong/kong/blob/master/kong/plugins/statsd/handler.lua

local function log(premature, conf, message)
  if premature then
    return
  end

Is this code related to ensuring not to execute until a certain phase has been achieved?

Note our Kong nodes do have load balancers in front of them with health checks calling into Kong so maybe one of those global log plugins getting triggered too early causes this? My health-check endpoint is just a dummy proxy with the request termination plugin enabled on it to return 200's.

Tieske commented 6 years ago

@jeremyjpj0916 the code you are referring to is a callback for a timer. The premature flag indicates whether the timer ran prematurely, which only happens when the Nginx worker is exiting (basically it says "I'm not executing this timer, but cancelling it because of a shutdown/reload")

thibaultcha commented 6 years ago

@jeremyjpj0916

@Tieske and I had a chat and we talked about this issue. @Tieske will make changes to the worker-events library that should make those errors disappear. Blocking I/O within init_worker is considered fine by us (the core and/or plugins should be able to do it, and many already do), so we believe the fix has to be in the library instead.

thibaultcha commented 6 years ago

Sorry to be dense on the matter, but can you give me an example plugin Kong currently has that does so?

I am not sure if any of the open source plugins do so already, but if you have a custom plugin that implements the init_worker phase, and that plugin does a request, or accesses the database via the DAO, then that is what I am referring to. Do you? If not, then it might be I/O done by the core. But regardless, we very much intend to preserve that possibility.

jeremyjpj0916 commented 6 years ago

None of the plugins we built use the DB, nor am I familiar with how to write a plugin that implements the init_worker phase. Generally if I write a plugin I pick one of the Kong open source plugins that has a somewhat similar codebase to go on and modify from there so its doubtful, as I learn the application better, understanding what phase they run in and how to utilize different phases will probably come in handy. So I am guessing it may be I/O done by the core. I also have kubernetes readiness probes with exec health checks calling the kong health cli command, generally notice the health-check call to fail twice too before Kong is officially up w its kong health returning, the command is considered successful if its exit code is 0:

      readinessProbe:
        exec:
          command:
            - kong
            - health
        failureThreshold: 2
        initialDelaySeconds: 20
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 3

Maybe its the readiness probe calls doing it via cli commands? Tried studying here - https://github.com/Kong/kong/blob/master/kong/cmd/health.lua but I don't see anything specifically doing what you mention would trigger it @thibaultcha. Would be easy to test and reproduce that locally though if Kong runs a Kubernetes cloud testing env and sets up a readiness probe as well that hits the CLI kong health command prior to Kong being up and healthy. One thing to note here because I am sure you are thinking no way Kong takes 20 seconds to boot up, I had to push that initial delay out because our Dynatrace agent firing up initially takes some extra time as well, otherwise my readiness probes kept cutting my pods in a constant redeploy loop thinking pod was in error state.

hishamhm commented 6 years ago

Tagging this as "bug" for bug tracker organizational purposes, even though this is pretty harmless and will be fixed in the worker-events library. This will be tracked in kong/lua-resty-worker-events#9 but for the mean time, leaving this open until the dependency bump with the fix lands in Kong next.

jeremyjpj0916 commented 6 years ago

Commented on the PR but I fully believe this to be resolved once merged 👍 .

thibaultcha commented 6 years ago

Fixed with #3443, thanks for reporting @jeremyjpj0916!