CRI metadata is not removed in the log field

kinseii commented 1 month ago

Bug Report

Some records have CRI metadata in the log field, such as timestamp and stdout. Note that the _p field is missing:

{
  "_index": "fluent-bit-*****-*******-2024.07.14-000120",
  "_id": "Rn2rtZA*******************",
  "_score": 1,
  "_source": {
    "@timestamp": "2024-07-15T09:14:05.031Z",
    "log": "2024-07-15T09:14:05.031661946Z stdout F {\"@t\":\"2024-07-15T09:14:05.0307804Z\",\"@m\":\"An unhandled exception has occurred while executing the request.*********************\"}}\n",
    "kubernetes": {
      "namespace_name": "*****",
      "annotations": {
        "*****************************************": "*******"
      },
      "labels": {
        "*****************************************": "*******"
      },
      "container_name": "***********",
      "pod_id": "*****************************************",
      "container_hash": "*******************************************************",
      "pod_name": "************************************",
      "container_image": "********************************************************",
      "docker_id": "********************************************************",
      "host": "********************************************************"
    }
  },
  "fields": {
    "@timestamp": [
      "2024-07-15T09:14:05.031Z"
    ]
  }
}

Fluent-bit settings:

    [SERVICE]
        Daemon Off
        Flush 1
        Log_Level info
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        Health_Check On
        storage.path /tmp/fluent-bit
        storage.sync full
        storage.checksum off
        storage.max_chunks_up 128
        storage.backlog.mem_limit 5M

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri, python, go, java
        Tag kube.*
        Skip_Long_Lines On
        Buffer_Max_Size 5MB
        Mem_Buf_Limit 500MB
        Storage.Type filesystem

    [INPUT]
        Name systemd
        Tag host.*
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Read_From_Tail On

    [FILTER]
        Name kubernetes
        Match kube.*
        Buffer_Size 2MB
        Merge_Log Off

    [FILTER]
        Name lua
        Match kube.*
        script /fluent-bit/scripts/common.lua
        call common

Most of the time, this metadata is deleted, but sometimes it is not. Besides, if we look at surrounding records, we will find a similar log, differing by microseconds, which will be processed properly, i.e. with this metadata removed and the _p field will also be added. The lua script further parses the log field if it is a json string. But since metadata is added at the beginning, it is not processed.

AKS v1.28.10 OpenSearch v2.12.0 Fluent-bit v3.0.3

Similar problem here, but we wouldn't want to lose multiline parsing of everything but cri: https://github.com/fluent/fluent-bit/issues/6515

So, can anyone tell me why this is happening and where to look?

edsiper commented 1 month ago

would you please provide a sample log file that we can use to reproduce the problem ? it seems to me the first JSON example provided is something composed from the output side, we need the input

kinseii commented 1 month ago

would you please provide a sample log file that we can use to reproduce the problem ? it seems to me the first JSON example provided is something composed from the output side, we need the input

@edsiper I apologize for the possibly noob questions. How can I compose a record from the INPUT side?

If I look at the k8s pod logs, the json there is exactly the same as in the OUPUT, except for the absence of timestamp and stdout metadata:

{"@t":"2024-07-15T20:20:16.1376705Z","@m":"An unhandled exception has occurred while executing the request.",**********************************}}

The log field contains 6.5 thousand characters. But we have a limit of 5 megabytes Buffer_Max_Size 5MB, so it shouldn't affect it. And besides, the records themselves appear in OpenSearch, and if they were cut off because of the Skip_Long_Lines On parameter, they wouldn't exist, as I suppose.

kinseii commented 1 month ago

Could someone please tell me where the _p field is added, it might narrow down the problem. Am I correct in assuming that this field adds a kubernetes filter while removing the time and stdout metadata from the log field?

Also, I have a hunch. Since the record has a correct duplicate (_p field added and timestamp and stdout removed from the log field), it is possible that the kubernetes filter does not delete the original record under some conditions, and fluent-bit sends both records to OUTPUT.

kinseii commented 1 month ago

For comparison, here are both entries.

Correct record (I removed some fields with the log_ prefix that the lua script parses and inserts for security reasons and left a few for clarity):

{
  "_index": "fluent-bit-*******************-2024.07.14-000120",
  "_id": "RX2rtZAB_McTC4bCrsTf",
  "_score": 1,
  "_source": {
    "@timestamp": "2024-07-15T09:14:05.031Z",
    "time": "2024-07-15T09:14:05.031661946Z",
    "stream": "stdout",
    "_p": "F",
    "log": "{\"@t\":\"2024-07-15T09:14:05.0307804Z\",\"@m\":\"An unhandled exception has occurred while executing the request.\",\"*******************\",}",
    "log_@m": "An unhandled exception has occurred while executing the request.",
    "log_@l": "Error",
    "log_@t": "2024-07-15T09:14:05.0307804Z",
    "log_EventId_Name": "UnhandledException",    
    "kubernetes": {
      "namespace_name": "*******************",
      "annotations": {
        "*******************": "*****",
      },
      "labels": {
        "*******************": "*****",
      },
      "container_name": "*******************",
      "pod_id": "*******************",
      "container_hash": "*******************",
      "pod_name": "*******************",
      "container_image": "*******************",
      "docker_id": "*******************",
      "host": "*******************",
    }
  },
  "fields": {
    "@timestamp": [
      "2024-07-15T09:14:05.031Z"
    ],
    "log_@t": [
      "2024-07-15T09:14:05.030Z"
    ],
    "time": [
      "2024-07-15T09:14:05.031Z"
    ]
  }
}

Unprocessed record. Some fields are missing and CRI metadata (timestamp and stdout) has not been removed. Since the log field value is string, the lua script skips the record and so there are no fields with the log_ prefix. However, the record is rich in kubernetes data (annotations, labels, etc.). Also please note that the _id fields differ only by the second character:

{
  "_index": "fluent-bit--*******************-2024.07.14-000120",
  "_id": "Rn2rtZAB_McTC4bCrsTf",
  "_score": 1,
  "_source": {
    "@timestamp": "2024-07-15T09:14:05.031Z",
    "log": "2024-07-15T09:14:05.031661946Z stdout F {\"@t\":\"2024-07-15T09:14:05.0307804Z\",\"@m\":\"An unhandled exception has occurred while executing the request.\",\"*******************\"}",
    "kubernetes": {
      "namespace_name": "*******************",
      "annotations": {
        "*******************": "*****",
      },
      "labels": {
        "*******************": "*****",
      },
      "container_name": "*******************",
      "pod_id": "*******************",
      "container_hash": "*******************",
      "pod_name": "*******************",
      "container_image": "*******************",
      "docker_id": "*******************",
      "host": "*******************",
    }
  },
  "fields": {
    "@timestamp": [
      "2024-07-15T09:14:05.031Z"
    ]
  }
}

cosmo0920 commented 1 month ago

I'm trying to take a look on it. What is the main function on this Lua script?

/fluent-bit/scripts/common.lua

cosmo0920 commented 4 weeks ago

I tried to create CRI-O using k8s environment and I found that the following:

CRI-O parser should parse CRI-O formatted log which has some of the parts:

2023-10-06T00:17:09.669794202Z stdout F Your log message here
2023-10-06T00:17:09.669794202Z stdout P Another log pt 1
2023-10-06T00:17:09.669794202Z stdout P Another log pt 2
2023-10-06T00:17:10.113242941Z stderr F Another log final

This type of log format is composed the following:

Timestamp stream(stdout or stderr) F(full)|P(partial) The main part of log

Also, it seems that the example log is ingested in Elasticsearch or OpenSearch, right?

Elasticsearch has @timestamp field to maintain the timestamp metadata. So, this key is not related to fluent-bit issue. This is just coming from Elasticsearch/OpenSearch responses by search query.

The incorrect record seems not to be parsed with cri parser correctly. stream key which is inside in the _source key will be created if parsing with cri parser is succeeded.

Plus, _id field is also attached by Elasticsearch/OpenSearch.

Also please note that the _id fields differ only by the second character:

This key guarantees the uniqueness of the ID and just added when ingested.

This type of CRI-O format parsing error is frequently observed in your environment? I didn't reproduce this parsing error in my dev box,

patrick-stephens commented 4 weeks ago

Does the AKS version indicate the actual K8S version? I wonder if there is a specific change or "feature" in that version so we may need to test with it too.

edsiper commented 3 weeks ago

moving it to the next milestone

kinseii commented 3 weeks ago

I'm trying to take a look on it. What is the main function on this Lua script?

/fluent-bit/scripts/common.lua

Hi, this script parses the log field if there is a JSON string there, using the dkjson library.

kinseii commented 3 weeks ago

Also, it seems that the example log is ingested in Elasticsearch or OpenSearch, right?

Yes, that's right. Fluent-bit sends data to OpenSearch.

This type of CRI-O format parsing error is frequently observed in your environment? I didn't reproduce this parsing error in my dev box,

It wasn't very often, rare enough. After I removed everything in multiline.parser except cri, it never happened again. I'm sure it's the same thing: https://github.com/fluent/fluent-bit/issues/6515 May be could also try tagging the author and others who also have the same problem from there.

kinseii commented 3 weeks ago

I'm trying to take a look on it. What is the main function on this Lua script?

/fluent-bit/scripts/common.lua

I uploaded this script here: https://github.com/kinseii/fluent-bit-lua-scripts

fluent / fluent-bit

CRI metadata is not removed in the log field #9088

Bug Report