fabric8io / fluent-plugin-kubernetes_metadata_filter

Enrich your fluentd events with Kubernetes metadata
Apache License 2.0
350 stars 166 forks source link

Labels missing in log record #142

Closed MartinAltmayer closed 5 years ago

MartinAltmayer commented 6 years ago

For some of my pods no Kubernetes labels are reported in the logs (record.kubernetes.labels does not exist) although the pod is labelled. It seems to happen mostly to short-running pods.

Unfortunately, my fluentd config processes logs based on labels (I am using kube-fluentd-operator), which fails if the labels are absent.

The test-pod I use to (sometimes) reproduce the issue is the following:

apiVersion: batch/v1
kind: Job
metadata:
  name: oneshot
  labels:
    component: logging-test
spec:
  parallelism: 5  # Not necessary
  completions: 5  # but might help to reproduce the issue
  template:
    metadata:
      labels:
        component: logging-test
    spec:
      restartPolicy: Never
      containers:
      - name: test
        image: <some image>
        imagePullPolicy: Always
        command: ['bash']
        args:
          - '-c'
          - |
            set -eu
            echo "Hallo Delta!"
            sleep 10s
            echo "Ciao Delta!"

Is there any debug/logging output I can provide to help you understand this issue?

I've also seen https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/issues/123 but as I use v2.1.2 this should be no problem anymore.

richm commented 6 years ago

turning on fluentd log_level debug or trace should show additional log information

jcantrill commented 6 years ago

I would expect this to be the case for short lived pods unless watch [1] is enabled. When watch is disabled, there is no way for the plugin to get the metadata because there is nothing to return from the API server; there is nothing to be found. One could also expect this condition when possibly the cache is rolled over due to size and age, high message processing with many pods.

[1] https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/master/lib/fluent/plugin/filter_kubernetes_metadata.rb#L43

jvassev commented 6 years ago

The metadata filter is applied like this:

<filter kubernetes.**>
  @type kubernetes_metadata
  merge_json_log true
  preserve_json_log true
  include_namespace_metadata false
</filter>

This would have watch enabled by default. So, it must be the cache then?

caiohasouza commented 5 years ago

I have the same problem, but in my case it happen only on short running pods and i have this error:

backend/fluentd-cloudwatch-hbhdv[fluentd-cloudwatch]: 2019-03-19 15:10:42 +0000 [warn]: #0 dump an error event: error_class=RuntimeError error="failed to expand %Q[k8s.domain.com.br.\#{record[\"kubernetes\"][\"labels\"][\"release\"]}.\#{record[\"kubernetes\"][\"container_name\"]}] : error = undefined method []' for nil:NilClass" location="/fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin/filter_record_transformer.rb:310:inrescue in expand'" tag="kubernetes.var.log.containers.monitoring-backup-duplicity-1553008200-5g6kb_monitoring_duplicity-cleanup-a7e1f1be4164672dee0f1ccd13431d67aab7dc32373d883b017c5537f44ad726.log" time=2019-03-19 15:10:11.548567705 +0000 record={"log"=>"Which can't be deleted because newer sets depend on them.\n", "stream"=>"stdout", "docker"=>{"container_id"=>"a7e1f1be4164672dee0f1ccd13431d67aab7dc32373d883b017c5537f44ad726"}, "kubernetes"=>{"container_name"=>"duplicity-cleanup", "namespace_name"=>"monitoring", "pod_name"=>"monitoring-backup-duplicity-1553008200-5g6kb", "pod_id"=>"a7e1f1be4164672dee0f1ccd13431d67aab7dc32373d883b017c5537f44ad726", "namespace_id"=>"78a930d9-20cb-11e9-a77f-023ff1239c2a"}}

I already added the "K8S_NODE_NAME" environment variable but the problem persist, if i remove this record all works fine:

kubernetes_app_name k8s.domain.com.br.${record["kubernetes"]["labels"]["release"]}.${record["kubernetes"]["container_name"]}

Regards

caiohasouza commented 5 years ago

Someone?

prabhu-natarajan commented 5 years ago

In case of short-lived pods, log aggregators may sometimes not have chance to collect both meta-data and data either completely or partially if the pod happened to be started and disposed between two consecutive polling attempts by aggregator. When this happens, either entire logs may go missing or just labels. Introducing a small delay (sleep 65) at the end of the container CMD helped resolve the issue for my situation. I set the delay to be slightly higher than polling interval setting of log aggregator. For example, fluend has a default refresh_interval of 60 seconds.

See also https://github.com/DataDog/datadog-agent/issues/2749

jcantrill commented 5 years ago

Closing this issue as we are only able to retrieve labels if:

You may need to tweek your fluent configuration and setting specific to this plugin to ensure labels may be applied.