bug: OpenSearchErrorHandler - 400

kaiohenricunha commented 1 year ago

Describe the issue

I deployed a fluentbit-fluentd solution using the fluent-operator's chart. Most of the logs are being properly indexed on OpenSearch and enriched with Kubernetes metadata.

But many logs are being rejected with the following warn:

2018-11-13 11:16:38 +0000 [warn]: #0 dump an error event: error_class=Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError error="400 - Rejected by OpenSearch" location=nil tag="fluent.warn" time=2018-11-13 11:16:27.978851140 +0000 record={"error"=>"#<Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError: 400 - Rejected by OpenSearch>", "location"=>nil, "tag"=>"app.fluentcat", "time"=>2018-11-13 11:16:17.492985640 +0000, "record"=>{"message"=>"\xFF\xAD"}, "message"=>"dump an error event: error_class=Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError error=\"400 - Rejected by OpenSearch\" location=nil tag=\"app.fluentcat\" time=2018-11-13 11:16:17.492985640 +0000 record={\"message\"=>\"\\xFF\\xAD\"}"}

What is worse is that these logs are not even discarded, fluentd keeps retrying to send them, which causes a flood of declined logs.

I set the log_os_400_reason to true on my ClusterOutput definition, and it gives me the exact reason for the error:

apiVersion: fluentd.fluent.io/v1alpha1
kind: ClusterOutput
metadata:
  name: cluster-output-opensearch
  labels:
    output.fluentd.fluent.io/enabled: "true"
    output.fluentd.fluent.io/tenant: "core"
spec:
  outputs:
    - customPlugin:
        config: |
          <match **>
            @type opensearch
            host "${FLUENT_OPENSEARCH_HOST}"
            port 443
            logstash_format  true
            logstash_prefix logs
            scheme https
            log_os_400_reason true
            @log_level ${FLUENTD_OUTPUT_LOGLEVEL:=debug}
            <endpoint>
              url "https://${FLUENT_OPENSEARCH_HOST}"
              region "${FLUENT_OPENSEARCH_REGION}"
              assume_role_arn "#{ENV['AWS_ROLE_ARN']}"
              assume_role_web_identity_token_file "#{ENV['AWS_WEB_IDENTITY_TOKEN_FILE']}"
            </endpoint>
          </match>

The warn now prints:

2023-04-19 15:10:36 +0000 [warn]: #0 dump an error event: error_class=Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError error="400 - Rejected by OpenSearch [error type]: mapper_parsing_exception [reason]: 'object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value'" location=nil tag="kube.var.log.containers.XXXXXXX.log" time=2023-04-19 15:10:34.715435870 +0000 record={"log"=>"XXXXXXXXXXX", "kubernetes"=>{"pod_name"=>"XXXXX", "namespace_name"=>"XXXX", "labels"=>{"app"=>"XXXX", "pod-template-hash"=>"XXXX", "release"=>"XXXX"}, "annotations"=>{"checksum/config"=>"XXXX", "checksum/secrets"=>"XXXX", "kubernetes.io/psp"=>"eks.privileged"}, "container_name"=>"XXXX", "docker_id"=>"XXXX", "container_image"=>"docker.io/XXXX/XXXXX"}}
2023-04-19 15:10:36 +0000 [debug]: #0 Indexed (op = index), 656 bad_argument

So, the reason is a mapping collision it seems:

"[error type]: mapper_parsing_exception [reason]: 'object mapping for [kubernetes.labels.app] tried to parse field [app] as object, but found a concrete value"

My logs also have a kubernetes.labels.app.kubernetes.io/foo label that is conflicting. It has been discussed here.

We do have the suggested replaceDots feature on fluentbit output CRD, but only if I'm sending logs from fluentbit directly to OpenSearch,

I'm using forward as my ClusterOutput to send logs from Fluentbit to Fluentd, and then to OpenSearch, and forward output doesn't have a replaceDots parameter.

I tried using dynamic index templates like it is suggested by the fluent-plugin-opensearch documentation to deal with this error: https://github.com/fluent/fluent-plugin-opensearch/blob/main/README.Troubleshooting.md#random-400---rejected-by-opensearch-is-occured-why .

It doesn't work either.

I then tried a different index template like this, mapping the troublesome label:

PUT _template/logs
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index": {
        "number_of_shards": "3",
        "number_of_replicas": "1",
        "refresh_interval": "30s"
      }
    },
    "mappings": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "created_at": {
          "format": "EEE MMM dd HH:mm:ss Z YYYY",
          "type": "date"
        },
        "host_name": {
          "type": "keyword"
        },
        "kubernetes": {
          "properties": {
            "labels": {
              "properties": {
                "app": {
                  "type": "keyword"
                }
              }
            }
          }
        }
      }
    }
  }
}

It solved the error for kubernetes.labels.app, but started throwing for other labels such as:

mapper_parsing_exception [reason]: 'Could not dynamically add mapping for field [app.kubernetes.io/component]. Existing mapping for [kubernetes.labels.app] must be of type object but found [text].

What will be the equivalent replaceDots configuration for Fluentd ClusterOutput? Is there any other viable solution?

To Reproduce

Install the operator via Helm chart with fluentbit and fluentd enabled.
Fluentbit should have forward as clusteroutput, which sends logs to fluentd service.
Fluentbit should have kubernetes.annotations.enabled and kubernetes.labels.enabled set to true.
Some pods with: kubernetes.labels.app and some pods with: kubernetes.labels.app.kubernetes.io/foo

Expected behavior

Logs being shipped to OpenSearch regardless of their labels. No OpenSearchErrorHandler - 400.

Your Environment

- Fluent Operator version: 2.1.0
- Container Runtime: docker
- Operating system: Linux
- Kernel version: 5.4.0-135-generic

How did you install fluent operator?

Helm chart.

Additional context

No response

leonardolacerdaatlantico commented 1 year ago

I'm also experiencing this issue! Any ideas? Thanks

ArthurSudbrackIbarra commented 1 year ago

Any ideas on what might be happening here?

kaiohenricunha commented 1 year ago

Issue resolved by adding a custom dedot Fluentd ClusterFilter:

apiVersion: fluentd.fluent.io/v1alpha1
kind: ClusterFilter
metadata:
  labels:
    filter.fluentd.fluent.io/enabled: "true"
    filter.fluentd.fluent.io/name: "de-dot"
  name: de-dot
spec:
  filters:
    - customPlugin:
        config: |
          <filter **>
            @type dedot
            de_dot_separator _
            de_dot_nested ${FLUENTD_DEDOT_NESTED:=true}
          </filter>
---
apiVersion: fluentd.fluent.io/v1alpha1
kind: ClusterFluentdConfig
metadata:
  labels:
    config.fluentd.fluent.io/enabled: "true"
  name: cluster-fluentd-config
spec:
  clusterFilterSelector:
    matchLabels:
      filter.fluentd.fluent.io/enabled: "true"
      filter.fluentd.fluent.io/name: "de-dot"
  clusterOutputSelector:
    matchLabels:
      output.fluentd.fluent.io/enabled: "true"
      output.fluentd.fluent.io/tenant: "core"
  watchedNamespaces: [] # watches all namespaces when empty
---

This is because field names containing dots can create ambiguity in certain data structures. For example, logs with the label kubernetes.labels.statefulset.kubernetes.io/pod-name will collide with the label kubernetes.labels.statefulset.kubernetes.io/pod-name.keyword.

In this case, OpenSearch will think that the json of this log should have the following format and try to repeat it twice:

{
  "kubernetes": {
    "labels": {
      "statefulset": {
        "kubernetes": {
          "io": {
            "pod-name": "some_value"
          }
        }
      }
    }
  }
}

This misinterpretation can cause unexpected behavior during indexing and querying, and might result in data loss or errors. By replacing dots with underscores using the de_dot filter, we can avoid such ambiguity and ensure that the field name is correctly interpreted.

After applying the dedot filter, it becomes:

{
  "kubernetes": {
    "labels": {
      "statefulset_kubernetes_io/pod-name": "some_pod_name",
      "statefulset_kubernetes_io/pod-name.keyword": "some_pod_name_keyword"
    }
  }
}

Dedot filter parameters would be a nice feature to have in the fluent-operator.

kaiohenricunha commented 1 year ago

@stephweiss did you get it solved?

stephweiss commented 1 year ago

@stephweiss did you get it solved?

Yes thank you for asking. My config was wrong. It totally worked the way it is described here

fluent / fluent-operator