kong 2.4.1 - correlation-id plugin - 1 digit change in uuid

StarwingRC commented 3 years ago

Summary

Kong version : 2.4.1
Mode: db-less
Deployment platform: Kubernetes (EKS) 1.17
number of pod instances: 3
Operating system: Amazon Linux 2
Summary: Normal operations, nothing fancy, but, from time to time, we get some strange entries in the logs which trigger some false positive alerts, involving correlation-ids:

Let's take this case: (ok) a91ba066-5efb-9677-b377-6a6b1a1c8233 and (nok) a91ba066-5efb-4677-b377-6a6b1a1c8233
1. In the kong logs we find this correlation-id a91ba066-5efb-4677-b377-6a6b1a1c8233 and it was logged at 10:40:25.418. The call for which this correlation-id got generated ended up in a 400 error. The problem with this correlation-id is that we do not find it upstream..nothing, nowhere.
2. Then....we have this correlation-id a91ba066-5efb-9677-b377-6a6b1a1c8233 (notice just the 1 digit change - third group (from left to right), first digit). We can't find this correlation-id in kong logs, but we find in the upstream systems, which it wouldn't make any sense because you have to pass through kong to get to backend systems. Also, the call is logged at 10:40:25.115 (before the correlation-id from 1). So it got generated before the one at point 1.
Can't explain what is happening. This is not a general event. It just happens randomly..or, at least, this is what we think so far.

Some other examples of correlation-ids:

(ok) f8d5ef6c-3ae5-910b-aee1-ebf38eb85273 and (nok) f8d5ef6c-3ae5-410b-aee1-ebf38eb85273
(ok) 74b8fb52-ade7-9d0d-bcb6-dfc1e2adf592 and (nok) 74b8fb52-ade7-4d0d-bcb6-dfc1e2adf592

Have you encountered this behavior before? How could we troubleshoot this further? More importantly :)..how could we fix it?

Thank you, in advance, for you comments and help

Tieske commented 3 years ago

are the other examples also close to each other (time wise)?

StarwingRC commented 3 years ago

Hello Tieske, yes, pretty close:

(ok) f8d5ef6c-3ae5-910b-aee1-ebf38eb85273 (first occurence in log at 17/08/2021 17:23:31.807) - no occurrence in kong log but in upstream systems; at some point this reqid has a http 400 error at 17/08/2021 17:23:34.588

(nok) f8d5ef6c-3ae5-410b-aee1-ebf38eb85273 (17/08/2021 17:23:33.306) - single occurrence in kong, but not in upstream; returned http 400 error at 17/08/2021 17:23:33.306

<<<<<<<<<>>>>>>>>>>>

(ok) 74b8fb52-ade7-9d0d-bcb6-dfc1e2adf592 (first occurence in log at 17/08/2021 23:00:48.358) - no occurrence in kong log but in upstream systems; first http 500 err for this is at 17/08/2021 23:00:49.198

(nok) 74b8fb52-ade7-4d0d-bcb6-dfc1e2adf592 (17/08/2021 23:00:48.557) - single occurrence in kong, but not in upstream; returned http 500 error at 17/08/2021 23:00:48.557

bungle commented 3 years ago

@CrystianRC, can you also share the correlation plugin configuration?

bungle commented 3 years ago

@CrystianRC is it possible that this comes from the client? It looks like the plugin only sets it if there is no header already? https://github.com/Kong/kong/blob/master/kong/plugins/correlation-id/handler.lua#L57

I am not sure if there needs to be option to always override existing header?

StarwingRC commented 3 years ago

Hello and thank you for your replies.

Below is the plugin's configuration:

`plugins:

name: correlation-id service: $ENV-api-service config: header_name: X-Request-Id generator: uuid`

Investigated more the issue and it's not for a specific call that is happening.

is it possible that this comes from the client? It looks like the plugin only sets it if there is no header already? i have doubts that this is getting set by the client. My doubt actually lies in the fact that both uuids are 99% the same, except for that 1 digit..I don't know, but this looks impossible to me

bungle commented 3 years ago

If the random state is same, this should result the same: https://github.com/thibaultcha/lua-resty-jit-uuid/blob/master/lib/resty/jit-uuid.lua#L161

But could that result sometimes different? If random state is same, then that is strange. What is your environment where you ran this? Randomseed might have to do with it, but then I would assume not just single char change. So it feels strange.

bungle commented 3 years ago

The bytes:

96 -> 46
91 -> 41
9d -> 4d

So in all your inputs the 9 has turned to 4.

bungle commented 3 years ago

I guess the 0x40 here means version 4: https://github.com/thibaultcha/lua-resty-jit-uuid/blob/master/lib/resty/jit-uuid.lua#L161

So I wrote this:

local bit = require 'bit'
local tohex = bit.tohex
local band = bit.band
local bor = bit.bor

for i = 0, 255 do
  print(tohex(bor(band(i, 0x0F), 0x40), 2))
end

From that I think Kong should always produce 4x for that.

When you say:

We can't find this correlation-id in kong logs, but we find in the upstream systems

How do you write that to Kong logs, and how can you see it in upstream system?

StarwingRC commented 3 years ago

Hello, @bungle , our kong is deployed in kubernetes and we "log" the correlation id using the following directive that is part of a configMap:

KONG_NGINX_HTTP_LOG_FORMAT: compression '[$time_iso8601] ip=[$http_X_Forwarded_For] scheme="$scheme host=$host path=$request_uri" status=$status resp_size=$body_bytes_sent user_agent="$http_user_agent" rt=$request_time requestId=$http_X_Request_Id'

how can you see it in upstream system this X_Request_Id header is passed along with the call and gets logged in the other system's logs. For every request made, we have a new requestId generated and logged. This is how we track a call throughout all the systems.

kikito commented 3 years ago

@CrystianRC It's kind of a long shot, but it's at least theoretically possible that the physical servers running your instances on Amazon have some sort of memory corruption problem. How feasible would it be for you to check if this happens on a different AWS instance? Perhaps you have a staging server that you can move to a different region, reproducing production traffic?

bungle commented 3 years ago

So far:

When kong logs, it correctly logs it with 4 in uuid version
When there is 9 in uuid version, you don't find it in kong logs
is that uuid corrupted only some times, and where does the corruption happen?

dndx commented 3 years ago

Hello @CrystianRC,

When Kong startup, do you see any kind of logs that looks like could not seed from OpenSSL RAND_bytes in the log? (from https://github.com/Kong/kong/blob/master/kong/globalpatches.lua#L292-L294)

Also, could you confirm if the similar UUID came from the same Nginx worker (worker with the same PID) or different workers?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zffocussss commented 2 years ago

Hello, @bungle , our kong is deployed in kubernetes and we "log" the correlation id using the following directive that is part of a configMap:

KONG_NGINX_HTTP_LOG_FORMAT: compression '[$time_iso8601] ip=[$http_X_Forwarded_For] scheme="$scheme host=$host path=$request_uri" status=$status resp_size=$body_bytes_sent user_agent="$http_user_agent" rt=$request_time requestId=$http_X_Request_Id'

how can you see it in upstream system this X_Request_Id header is passed along with the call and gets logged in the other system's logs. For every request made, we have a new requestId generated and logged. This is how we track a call throughout all the systems.

You save me.So there are three ways we can operate kong logs

request header (http_my_request_header)
response header (sent_http_my_response_header)
variables set by ngx.var.MY_VALUE = xxx ($$MY_VALUE)

Kong / kong

kong 2.4.1 - correlation-id plugin - 1 digit change in uuid #7737

Summary