Open kinseii opened 1 month ago
would you please provide a sample log file that we can use to reproduce the problem ? it seems to me the first JSON example provided is something composed from the output side, we need the input
would you please provide a sample log file that we can use to reproduce the problem ? it seems to me the first JSON example provided is something composed from the output side, we need the input
@edsiper I apologize for the possibly noob questions. How can I compose a record from the INPUT side?
If I look at the k8s pod logs, the json there is exactly the same as in the OUPUT, except for the absence of timestamp and stdout metadata:
{"@t":"2024-07-15T20:20:16.1376705Z","@m":"An unhandled exception has occurred while executing the request.",**********************************}}
The log field contains 6.5 thousand characters. But we have a limit of 5 megabytes Buffer_Max_Size 5MB
, so it shouldn't affect it. And besides, the records themselves appear in OpenSearch, and if they were cut off because of the Skip_Long_Lines On
parameter, they wouldn't exist, as I suppose.
Could someone please tell me where the _p
field is added, it might narrow down the problem. Am I correct in assuming that this field adds a kubernetes filter while removing the time and stdout metadata from the log field?
Also, I have a hunch. Since the record has a correct duplicate (_p
field added and timestamp and stdout removed from the log field), it is possible that the kubernetes filter does not delete the original record under some conditions, and fluent-bit sends both records to OUTPUT.
For comparison, here are both entries.
Correct record (I removed some fields with the log_
prefix that the lua script parses and inserts for security reasons and left a few for clarity):
{
"_index": "fluent-bit-*******************-2024.07.14-000120",
"_id": "RX2rtZAB_McTC4bCrsTf",
"_score": 1,
"_source": {
"@timestamp": "2024-07-15T09:14:05.031Z",
"time": "2024-07-15T09:14:05.031661946Z",
"stream": "stdout",
"_p": "F",
"log": "{\"@t\":\"2024-07-15T09:14:05.0307804Z\",\"@m\":\"An unhandled exception has occurred while executing the request.\",\"*******************\",}",
"log_@m": "An unhandled exception has occurred while executing the request.",
"log_@l": "Error",
"log_@t": "2024-07-15T09:14:05.0307804Z",
"log_EventId_Name": "UnhandledException",
"kubernetes": {
"namespace_name": "*******************",
"annotations": {
"*******************": "*****",
},
"labels": {
"*******************": "*****",
},
"container_name": "*******************",
"pod_id": "*******************",
"container_hash": "*******************",
"pod_name": "*******************",
"container_image": "*******************",
"docker_id": "*******************",
"host": "*******************",
}
},
"fields": {
"@timestamp": [
"2024-07-15T09:14:05.031Z"
],
"log_@t": [
"2024-07-15T09:14:05.030Z"
],
"time": [
"2024-07-15T09:14:05.031Z"
]
}
}
Unprocessed record. Some fields are missing and CRI metadata (timestamp and stdout) has not been removed. Since the log
field value is string, the lua script skips the record and so there are no fields with the log_
prefix. However, the record is rich in kubernetes data (annotations, labels, etc.). Also please note that the _id
fields differ only by the second character:
{
"_index": "fluent-bit--*******************-2024.07.14-000120",
"_id": "Rn2rtZAB_McTC4bCrsTf",
"_score": 1,
"_source": {
"@timestamp": "2024-07-15T09:14:05.031Z",
"log": "2024-07-15T09:14:05.031661946Z stdout F {\"@t\":\"2024-07-15T09:14:05.0307804Z\",\"@m\":\"An unhandled exception has occurred while executing the request.\",\"*******************\"}",
"kubernetes": {
"namespace_name": "*******************",
"annotations": {
"*******************": "*****",
},
"labels": {
"*******************": "*****",
},
"container_name": "*******************",
"pod_id": "*******************",
"container_hash": "*******************",
"pod_name": "*******************",
"container_image": "*******************",
"docker_id": "*******************",
"host": "*******************",
}
},
"fields": {
"@timestamp": [
"2024-07-15T09:14:05.031Z"
]
}
}
I'm trying to take a look on it. What is the main function on this Lua script?
/fluent-bit/scripts/common.lua
I tried to create CRI-O using k8s environment and I found that the following:
CRI-O parser should parse CRI-O formatted log which has some of the parts:
2023-10-06T00:17:09.669794202Z stdout F Your log message here
2023-10-06T00:17:09.669794202Z stdout P Another log pt 1
2023-10-06T00:17:09.669794202Z stdout P Another log pt 2
2023-10-06T00:17:10.113242941Z stderr F Another log final
This type of log format is composed the following:
Timestamp stream(stdout or stderr) F(full)|P(partial) The main part of log
Also, it seems that the example log is ingested in Elasticsearch or OpenSearch, right?
Elasticsearch has @timestamp
field to maintain the timestamp metadata.
So, this key is not related to fluent-bit issue. This is just coming from Elasticsearch/OpenSearch responses by search query.
The incorrect record seems not to be parsed with cri parser correctly. stream
key which is inside in the _source
key will be created if parsing with cri parser is succeeded.
Plus, _id
field is also attached by Elasticsearch/OpenSearch.
Also please note that the _id fields differ only by the second character:
This key guarantees the uniqueness of the ID and just added when ingested.
This type of CRI-O format parsing error is frequently observed in your environment? I didn't reproduce this parsing error in my dev box,
Does the AKS version indicate the actual K8S version? I wonder if there is a specific change or "feature" in that version so we may need to test with it too.
moving it to the next milestone
I'm trying to take a look on it. What is the main function on this Lua script?
/fluent-bit/scripts/common.lua
Hi, this script parses the log field if there is a JSON string there, using the dkjson library.
Also, it seems that the example log is ingested in Elasticsearch or OpenSearch, right?
Yes, that's right. Fluent-bit sends data to OpenSearch.
This type of CRI-O format parsing error is frequently observed in your environment? I didn't reproduce this parsing error in my dev box,
It wasn't very often, rare enough. After I removed everything in multiline.parser
except cri
, it never happened again.
I'm sure it's the same thing: https://github.com/fluent/fluent-bit/issues/6515
May be could also try tagging the author and others who also have the same problem from there.
I'm trying to take a look on it. What is the main function on this Lua script?
/fluent-bit/scripts/common.lua
I uploaded this script here: https://github.com/kinseii/fluent-bit-lua-scripts
Bug Report
Some records have CRI metadata in the log field, such as timestamp and stdout. Note that the
_p
field is missing:Fluent-bit settings:
Most of the time, this metadata is deleted, but sometimes it is not. Besides, if we look at surrounding records, we will find a similar log, differing by microseconds, which will be processed properly, i.e. with this metadata removed and the
_p
field will also be added. The lua script further parses the log field if it is a json string. But since metadata is added at the beginning, it is not processed.AKS v1.28.10 OpenSearch v2.12.0 Fluent-bit v3.0.3
Similar problem here, but we wouldn't want to lose multiline parsing of everything but cri: https://github.com/fluent/fluent-bit/issues/6515
So, can anyone tell me why this is happening and where to look?