Closed StarwingRC closed 2 years ago
are the other examples also close to each other (time wise)?
Hello Tieske, yes, pretty close:
(ok) f8d5ef6c-3ae5-910b-aee1-ebf38eb85273 (first occurence in log at 17/08/2021 17:23:31.807) - no occurrence in kong log but in upstream systems; at some point this reqid has a http 400 error at 17/08/2021 17:23:34.588
(nok) f8d5ef6c-3ae5-410b-aee1-ebf38eb85273 (17/08/2021 17:23:33.306) - single occurrence in kong, but not in upstream; returned http 400 error at 17/08/2021 17:23:33.306
<<<<<<<<<>>>>>>>>>>>
(ok) 74b8fb52-ade7-9d0d-bcb6-dfc1e2adf592 (first occurence in log at 17/08/2021 23:00:48.358) - no occurrence in kong log but in upstream systems; first http 500 err for this is at 17/08/2021 23:00:49.198
(nok) 74b8fb52-ade7-4d0d-bcb6-dfc1e2adf592 (17/08/2021 23:00:48.557) - single occurrence in kong, but not in upstream; returned http 500 error at 17/08/2021 23:00:48.557
@CrystianRC, can you also share the correlation plugin configuration?
@CrystianRC is it possible that this comes from the client? It looks like the plugin only sets it if there is no header already? https://github.com/Kong/kong/blob/master/kong/plugins/correlation-id/handler.lua#L57
I am not sure if there needs to be option to always override existing header?
Hello and thank you for your replies.
Below is the plugin's configuration:
`plugins:
Investigated more the issue and it's not for a specific call that is happening.
is it possible that this comes from the client? It looks like the plugin only sets it if there is no header already?
i have doubts that this is getting set by the client. My doubt actually lies in the fact that both uuids are 99% the same, except for that 1 digit..I don't know, but this looks impossible to me
If the random state is same, this should result the same: https://github.com/thibaultcha/lua-resty-jit-uuid/blob/master/lib/resty/jit-uuid.lua#L161
But could that result sometimes different? If random state is same, then that is strange. What is your environment where you ran this? Randomseed might have to do with it, but then I would assume not just single char change. So it feels strange.
The bytes:
96 -> 46
91 -> 41
9d -> 4d
So in all your inputs the 9
has turned to 4
.
I guess the 0x40
here means version 4
:
https://github.com/thibaultcha/lua-resty-jit-uuid/blob/master/lib/resty/jit-uuid.lua#L161
So I wrote this:
local bit = require 'bit'
local tohex = bit.tohex
local band = bit.band
local bor = bit.bor
for i = 0, 255 do
print(tohex(bor(band(i, 0x0F), 0x40), 2))
end
From that I think Kong should always produce 4x
for that.
When you say:
We can't find this correlation-id in kong logs, but we find in the upstream systems
How do you write that to Kong logs, and how can you see it in upstream system?
Hello, @bungle , our kong is deployed in kubernetes and we "log" the correlation id using the following directive that is part of a configMap:
KONG_NGINX_HTTP_LOG_FORMAT: compression '[$time_iso8601] ip=[$http_X_Forwarded_For] scheme="$scheme host=$host path=$request_uri" status=$status resp_size=$body_bytes_sent user_agent="$http_user_agent" rt=$request_time requestId=$http_X_Request_Id'
how can you see it in upstream system
this X_Request_Id header is passed along with the call and gets logged in the other system's logs.
For every request made, we have a new requestId generated and logged.
This is how we track a call throughout all the systems.
@CrystianRC It's kind of a long shot, but it's at least theoretically possible that the physical servers running your instances on Amazon have some sort of memory corruption problem. How feasible would it be for you to check if this happens on a different AWS instance? Perhaps you have a staging server that you can move to a different region, reproducing production traffic?
So far:
4
in uuid version9
in uuid version, you don't find it in kong logsuuid
corrupted only some times, and where does the corruption happen?Hello @CrystianRC,
When Kong startup, do you see any kind of logs that looks like could not seed from OpenSSL RAND_bytes
in the log? (from https://github.com/Kong/kong/blob/master/kong/globalpatches.lua#L292-L294)
Also, could you confirm if the similar UUID came from the same Nginx worker (worker with the same PID) or different workers?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello, @bungle , our kong is deployed in kubernetes and we "log" the correlation id using the following directive that is part of a configMap:
KONG_NGINX_HTTP_LOG_FORMAT: compression '[$time_iso8601] ip=[$http_X_Forwarded_For] scheme="$scheme host=$host path=$request_uri" status=$status resp_size=$body_bytes_sent user_agent="$http_user_agent" rt=$request_time requestId=$http_X_Request_Id'
how can you see it in upstream system
this X_Request_Id header is passed along with the call and gets logged in the other system's logs. For every request made, we have a new requestId generated and logged. This is how we track a call throughout all the systems.
You save me.So there are three ways we can operate kong logs
Summary
Kong version : 2.4.1
Mode: db-less
Deployment platform: Kubernetes (EKS) 1.17
number of pod instances: 3
Operating system: Amazon Linux 2
Summary: Normal operations, nothing fancy, but, from time to time, we get some strange entries in the logs which trigger some false positive alerts, involving correlation-ids:
Let's take this case: (ok) a91ba066-5efb-9677-b377-6a6b1a1c8233 and (nok) a91ba066-5efb-4677-b377-6a6b1a1c8233
Can't explain what is happening. This is not a general event. It just happens randomly..or, at least, this is what we think so far.
Some other examples of correlation-ids:
Have you encountered this behavior before? How could we troubleshoot this further? More importantly :)..how could we fix it?
Thank you, in advance, for you comments and help