fluent / fluentd-kubernetes-daemonset

Fluentd daemonset for Kubernetes and it Docker image
Apache License 2.0
1.25k stars 981 forks source link

`free(): invalid pointer` with latest fluent/fluentd-kubernetes-daemonset:v1-debian-forward-arm64 image #1478

Open smparekh opened 5 months ago

smparekh commented 5 months ago

Describe the bug

Using the latest v1-debian-forward-arm64 image results in the container throwing free(): invalid pointer and constantly restarting leading to a node eviction

To Reproduce

I have provided a redacted config to reproduce

Expected behavior

Worker should comeup and stay up

Your Environment

- Tag of using fluentd-kubernetes-daemonset:v1-debian-forward-arm64

Your Configuration

@include "#{ENV['FLUENTD_SYSTEMD_CONF'] || 'systemd'}.conf"
    @include "#{ENV['FLUENTD_PROMETHEUS_CONF'] || 'prometheus'}.conf"
    @include conf.d/*.

    <label @FLUENT_LOG>
      <match fluent.**>
        @type null
        @id ignore_fluent_logs

    <match kubelet>
      @type null

    <filter kubernetes.**>
      @type kubernetes_metadata
      @id filter_kube_metadata
      kubernetes_url "#{ENV['FLUENT_FILTER_KUBERNETES_URL'] || 'https://' + ENV.fetch('KUBERNETES_SERVICE_HOST') + ':' + ENV.fetch('KUBERNETES_SERVICE_PORT') + '/api'}"
      verify_ssl "#{ENV['KUBERNETES_VERIFY_SSL'] || true}"
      ca_file "#{ENV['KUBERNETES_CA_FILE']}"
      skip_labels "#{ENV['FLUENT_KUBERNETES_METADATA_SKIP_LABELS'] || 'false'}"
      skip_container_metadata "#{ENV['FLUENT_KUBERNETES_METADATA_SKIP_CONTAINER_METADATA'] || 'false'}"
      skip_master_url "#{ENV['FLUENT_KUBERNETES_METADATA_SKIP_MASTER_URL'] || 'false'}"
      skip_namespace_metadata "#{ENV['FLUENT_KUBERNETES_METADATA_SKIP_NAMESPACE_METADATA'] || 'false'}"
      watch "#{ENV['FLUENT_KUBERNETES_WATCH'] || 'true'}"

      @type tail
      @id in_tail_container_logs
      path "#{ENV['FLUENT_CONTAINER_TAIL_PATH'] || '/var/log/containers/*.log'}"
      pos_file "#{File.join('/var/log/', ENV.fetch('FLUENT_POS_EXTRA_DIR', ''), 'fluentd-containers.log.pos')}"
      tag "#{ENV['FLUENT_CONTAINER_TAIL_TAG'] || 'kubernetes.*'}"
      exclude_path "#{ENV['FLUENT_CONTAINER_TAIL_EXCLUDE_PATH'] || use_default}"
      read_from_head true
        @type "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
        time_format "#{ENV['FLUENT_CONTAINER_TAIL_PARSER_TIME_FORMAT'] || '%Y-%m-%dT%H:%M:%S.%NZ'}"

    <filter qfunctions.**>
      @type record_transformer
      enable_ruby true
        message ${record["message"].gsub(/^.*std(out|err):\s/, '')}

    <filter qfunctions.**>
      @type parser
      format json
      key_name message
      emit_invalid_record_to_error false

    <match qfunctions.**>
      @type rewrite_tag_filter
        key tenant_id
        pattern /^abc1234$/
        tag abc1234
        key tenant_id
        pattern /.+/
        tag clear
    <match abc1234.**>
      @type http
      @id out_abc1234
      @log_level info

      endpoint "#{ENV['ENDPOINT']}"
      http_method post
      content_type application/json
      json_array true
        @type json
      headers {"X-P-Stream": "functions", "X-P-Meta-Org-Id": "abc1234"}
        method basic
        username "#{ENV['USERNAME']}"
        password "#{ENV['PASSWORD']}"

    <match clear>
      @type null

### Your Error Log

2024-01-17 15:48:14 +0000 [error]: Worker 0 exited unexpectedly with signal SIGABRT
2024-01-17 15:48:15 +0000 [info]: #0 init worker0 logger path=nil rotate_age=nil rotate_size=nil
2024-01-17 15:48:15 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.**" type="null"
2024-01-17 15:48:15 +0000 [info]: adding match pattern="kubelet" type="null"
2024-01-17 15:48:15 +0000 [info]: adding filter pattern="kubernetes.**" type="kubernetes_metadata"
2024-01-17 15:48:15 +0000 [info]: adding filter pattern="qfunctions.**" type="record_transformer"
2024-01-17 15:48:15 +0000 [info]: adding filter pattern="qfunctions.**" type="parser"
2024-01-17 15:48:15 +0000 [info]: adding match pattern="qfunctions.**" type="rewrite_tag_filter"
2024-01-17 15:48:15 +0000 [info]: #0 adding rewrite_tag_filter rule: tenant_id [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x0000ffff7b7b91b8 @keys="tenant_id">, /^abc1234$/, "", "abc1234", nil]
2024-01-17 15:48:15 +0000 [info]: #0 adding rewrite_tag_filter rule: tenant_id [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x0000ffff7b7b8790 @keys="tenant_id">, /.+/, "", "clear", nil]
2024-01-17 15:48:15 +0000 [info]: adding match pattern="abc1234.**" type="http"
2024-01-17 15:48:15 +0000 [warn]: #0 [out_abc1234] Status code 503 is going to be removed from default `retryable_response_codes` from fluentd v2. Please add it by yourself if you wish
2024-01-17 15:48:15 +0000 [info]: adding match pattern="clear" type="null"
2024-01-17 15:48:15 +0000 [info]: adding source type="systemd"
2024-01-17 15:48:15 +0000 [info]: adding source type="systemd"
2024-01-17 15:48:15 +0000 [info]: adding source type="systemd"
2024-01-17 15:48:15 +0000 [info]: adding source type="prometheus"
2024-01-17 15:48:15 +0000 [info]: adding source type="prometheus_output_monitor"
2024-01-17 15:48:15 +0000 [info]: adding source type="tail"
2024-01-17 15:48:15 +0000 [info]: #0 starting fluentd worker pid=361 ppid=6 worker=0
2024-01-17 15:48:15 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/contact-task-runtime-5cbd49696c-fmqkz_openfaas-fn_contact-task-runtime-90840620b3e6f1d26b85a666402b31aa3a5d5f9faf8f2388c919c87c5ce082a1.log
2024-01-17 15:48:15 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/ground-task-runtime-65446d7bcc-527dl_openfaas-fn_ground-task-runtime-b12db1d88da3a582965a7ff372367d9676e9e640f505694022c6f5da97649e46.log
2024-01-17 15:48:15 +0000 [info]: #0 fluentd worker is now running worker=0
free(): invalid pointer
2024-01-17 15:48:17 +0000 [error]: Worker 0 exited unexpectedly with signal SIGABRT
2024-01-17 15:48:18 +0000 [info]: #0 init worker0 logger path=nil rotate_age=nil rotate_size=nil
2024-01-17 15:48:18 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.**" type="null"
2024-01-17 15:48:18 +0000 [info]: adding match pattern="kubelet" type="null"
2024-01-17 15:48:18 +0000 [info]: adding filter pattern="kubernetes.**" type="kubernetes_metadata"
2024-01-17 15:48:18 +0000 [info]: adding filter pattern="qfunctions.**" type="record_transformer"
2024-01-17 15:48:18 +0000 [info]: adding filter pattern="qfunctions.**" type="parser"
2024-01-17 15:48:18 +0000 [info]: adding match pattern="qfunctions.**" type="rewrite_tag_filter"
2024-01-17 15:48:18 +0000 [info]: #0 adding rewrite_tag_filter rule: tenant_id [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x0000ffff8cd245b0 @keys="tenant_id">, /^org_2Jf4UxF6FEwCMecX$/, "", "abc1234", nil]
2024-01-17 15:48:18 +0000 [info]: #0 adding rewrite_tag_filter rule: tenant_id [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x0000ffff8cd23f98 @keys="tenant_id">, /.+/, "", "clear", nil]
2024-01-17 15:48:18 +0000 [info]: adding match pattern="abc1234.**" type="http"
2024-01-17 15:48:18 +0000 [warn]: #0 [out_abc1234] Status code 503 is going to be removed from default `retryable_response_codes` from fluentd v2. Please add it by yourself if you wish
2024-01-17 15:48:18 +0000 [info]: adding match pattern="clear" type="null"
2024-01-17 15:48:18 +0000 [info]: adding source type="systemd"
2024-01-17 15:48:18 +0000 [info]: adding source type="systemd"
2024-01-17 15:48:18 +0000 [info]: adding source type="systemd"
2024-01-17 15:48:18 +0000 [info]: adding source type="prometheus"
2024-01-17 15:48:18 +0000 [info]: adding source type="prometheus_output_monitor"
2024-01-17 15:48:18 +0000 [info]: adding source type="tail"
2024-01-17 15:48:18 +0000 [info]: #0 starting fluentd worker pid=376 ppid=6 worker=0
2024-01-17 15:48:18 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/contact-task-runtime-5cbd49696c-fmqkz_openfaas-fn_contact-task-runtime-90840620b3e6f1d26b85a666402b31aa3a5d5f9faf8f2388c919c87c5ce082a1.log
2024-01-17 15:48:18 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/ground-task-runtime-65446d7bcc-527dl_openfaas-fn_ground-task-runtime-b12db1d88da3a582965a7ff372367d9676e9e640f505694022c6f5da97649e46.log
2024-01-17 15:48:18 +0000 [info]: #0 fluentd worker is now running worker=0
free(): invalid pointer

Additional context

we have a daemonset in a cluster running from about 22d ago where we are not seeing the invalid pointer issue

smparekh commented 5 months ago

the sha 256 digest we are having issue with: 59886dc179d52a43dfdf061c764e9856dafc67c41dd78e9d868872000d9e660a

smparekh commented 5 months ago

reverting to this sha: f0c0d41aba562c5f4ce13f2b00ae50c381925063cfcc7ec7a9f2a4f622ee9535 doesn't throw invalid pointer

StevenChangNoodoe commented 5 months ago

I have the same issue in fluent/fluentd-kubernetes-daemonset:v1-debian-cloudwatch. I revert to this sha: b7185b3483d2ca5c3e923e33641dd3814865321b34da05c46eda96576da905a0 doesn't throw this error too. v1-debian-cloudwatch.log

CAR6807 commented 3 months ago

Also seeing this in fluent/fluentd-kubernetes-daemonset:v1.16.5-debian-forward-1.0 image

logging fails

2024-04-03 20:27:34 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/node-problem-detector-kwwk8_kube-system_node-problem-detector-4e2796e4c3ca14953fda355aca52c0200a0f53b7b0596d7e94ec89169c782f8a.log
2024-04-03 20:27:34 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/unbound-exporter-llm48_unbound_unbound-exporter-bd636614623be73dc03069f9a0fefffb779c47d2c034e796d3364fb49fb2e6fe.log
2024-04-03 20:27:34 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/unbound-exporter-llm48_unbound_unbound-exporter-init-1b88c92fa871c07c66d558a84a656879a1b13dfa12c6b533b37ec9ae74fc555f.log
2024-04-03 20:27:34 +0000 [info]: #0 fluentd worker is now running worker=0
free(): invalid pointer
2024-04-03 20:27:37 +0000 [error]: Worker 0 exited unexpectedly with signal SIGABRT
2024-04-03 20:27:37 +0000 [info]: #0 init worker0 logger path=nil rotate_age=nil rotate_size=nil
github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days