elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
69 stars 3.5k forks source link

Intermittent logstash pod pipeline stops with [DeadLetterQueueWriter] unable to finalize segment #15078

Closed ukho-cfreeman closed 1 year ago

ukho-cfreeman commented 1 year ago

Logstash information:

Please include the following information:

  1. Logstash version: logstash 8.7.1
  2. Logstash installation source: official docker image
  3. How is Logstash being run: kubernetes

Plugins installed:

logstash-codec-cef (6.2.6)
logstash-codec-collectd (3.1.0)
logstash-codec-dots (3.0.6)
logstash-codec-edn (3.1.0)
logstash-codec-edn_lines (3.1.0)
logstash-codec-es_bulk (3.1.0)
logstash-codec-fluent (3.4.2)
logstash-codec-graphite (3.0.6)
logstash-codec-json (3.1.1)
logstash-codec-json_lines (3.1.0)
logstash-codec-line (3.1.1)
logstash-codec-msgpack (3.1.0)
logstash-codec-multiline (3.1.1)
logstash-codec-netflow (4.3.0)
logstash-codec-plain (3.1.0)
logstash-codec-rubydebug (3.1.0)
logstash-filter-aggregate (2.10.0)
logstash-filter-anonymize (3.0.6)
logstash-filter-cidr (3.1.3)
logstash-filter-clone (4.2.0)
logstash-filter-csv (3.1.1)
logstash-filter-date (3.1.15)
logstash-filter-de_dot (1.0.4)
logstash-filter-dissect (1.2.5)
logstash-filter-dns (3.2.0)
logstash-filter-drop (3.0.5)
logstash-filter-elasticsearch (3.13.0)
logstash-filter-fingerprint (3.4.2)
logstash-filter-geoip (7.2.13)
logstash-filter-grok (4.4.3)
logstash-filter-http (1.4.3)
logstash-filter-json (3.2.0)
logstash-filter-json_encode (3.0.3)
logstash-filter-kv (4.7.0)
logstash-filter-memcached (1.1.0)
logstash-filter-metrics (4.0.7)
logstash-filter-mutate (3.5.6)
logstash-filter-prune (3.0.4)
logstash-filter-ruby (3.1.8)
logstash-filter-sleep (3.0.7)
logstash-filter-split (3.1.8)
logstash-filter-syslog_pri (3.2.0)
logstash-filter-throttle (4.0.4)
logstash-filter-translate (3.4.0)
logstash-filter-truncate (1.0.5)
logstash-filter-urldecode (3.0.6)
logstash-filter-useragent (3.3.4)
logstash-filter-uuid (3.0.5)
logstash-filter-xml (4.2.0)
logstash-input-azure_event_hubs (1.4.4)
logstash-input-beats (6.5.0)
└── logstash-input-elastic_agent (alias)
logstash-input-couchdb_changes (3.1.6)
logstash-input-dead_letter_queue (2.0.0)
logstash-input-elasticsearch (4.16.0)
logstash-input-exec (3.6.0)
logstash-input-file (4.4.4)
logstash-input-ganglia (3.1.4)
logstash-input-gelf (3.3.2)
logstash-input-generator (3.1.0)
logstash-input-graphite (3.0.6)
logstash-input-heartbeat (3.1.1)
logstash-input-http (3.6.1)
logstash-input-http_poller (5.4.0)
logstash-input-imap (3.2.0)
logstash-input-jms (3.2.2)
logstash-input-kinesis (2.2.1)
logstash-input-pipe (3.1.0)
logstash-input-redis (3.7.0)
logstash-input-snmp (1.3.1)
logstash-input-snmptrap (3.1.0)
logstash-input-stdin (3.4.0)
logstash-input-syslog (3.6.0)
logstash-input-tcp (6.3.2)
logstash-input-twitter (4.1.0)
logstash-input-udp (3.5.0)
logstash-input-unix (3.1.2)
logstash-integration-aws (7.1.1)
 ├── logstash-codec-cloudfront
 ├── logstash-codec-cloudtrail
 ├── logstash-input-cloudwatch
 ├── logstash-input-s3
 ├── logstash-input-sqs
 ├── logstash-output-cloudwatch
 ├── logstash-output-s3
 ├── logstash-output-sns
 └── logstash-output-sqs
logstash-integration-elastic_enterprise_search (2.2.1)
 ├── logstash-output-elastic_app_search
 └──  logstash-output-elastic_workplace_search
logstash-integration-jdbc (5.4.1)
 ├── logstash-input-jdbc
 ├── logstash-filter-jdbc_streaming
 └── logstash-filter-jdbc_static
logstash-integration-kafka (10.12.0)
 ├── logstash-input-kafka
 └── logstash-output-kafka
logstash-integration-rabbitmq (7.3.1)
 ├── logstash-input-rabbitmq
 └── logstash-output-rabbitmq
logstash-output-csv (3.0.8)
logstash-output-elasticsearch (11.13.1)
logstash-output-email (4.1.1)
logstash-output-file (4.3.0)
logstash-output-graphite (3.1.6)
logstash-output-http (5.5.0)
logstash-output-lumberjack (3.1.9)
logstash-output-nagios (3.0.6)
logstash-output-null (3.0.5)
logstash-output-opensearch (2.0.0)
logstash-output-pipe (3.0.6)
logstash-output-redis (5.0.0)
logstash-output-stdout (3.1.4)
logstash-output-tcp (6.1.1)
logstash-output-udp (3.2.0)
logstash-output-webhdfs (3.0.6)
logstash-patterns-core (4.3.4)

JVM (e.g. java -version): not installed in docker image

OS version : Linux 9c7bb12feea2 5.10.47-linuxkit #1 SMP Sat Jul 3 21:51:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) pipeline definition(s), settings, locale, etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it. Environment variables:

    - name: LOG_LEVEL
      value: info
    - name: LS_JAVA_OPTS
      value: -Xmx1g -Xms1g -Dnetworkaddress.cache.ttl=0 -Dlog4j2.formatMsgNoLookups=true
    - name: QUEUE_TYPE
      value: persisted
    - name: PATH_QUEUE
      value: /logstash-data/persisted_queue
    - name: PIPELINE_BATCH_SIZE
      value: "500"
    - name: XPACK_MONITORING_ENABLED
      value: "false"
    - name: DEAD_LETTER_QUEUE_ENABLE
      value: "true"
    - name: PATH_DEAD_LETTER_QUEUE
      value: /usr/share/logstash/data/dead_letter_queue
    - name: DEAD_LETTER_QUEUE_MAX_BYTES
      value: "209715200"
input {
  kinesis {
    kinesis_stream_name => "kinesis-stream"
    initial_position_in_stream => "LATEST"
    application_name => "application_name"
    region => "eu-west-2"
    codec => json { ecs_compatibility => "disabled" }
    additional_settings => {"initial_lease_table_read_capacity" => 10 "initial_lease_table_write_capacity" => 50}
  }
}

input {
  dead_letter_queue {
    id => "kubernetes_dlq"
    path => "/usr/share/logstash/data/dead_letter_queue"
    sincedb_path => "/usr/share/logstash/data/sincedb_dlq"
    commit_offsets => true
    clean_consumed => true
    tags => ["dlq"]
  }
}

output {
    opensearch {
      id => "kubernetes_es"
      index => "%{[@metadata][index_field]}"
      hosts => [ "https://endpoint:443" ]
      manage_template => false
      ssl => true
      timeout => 200
      retry_initial_interval => 100
      retry_max_interval => 900
      user => "${LS_ELASTICSEARCH_USER}"
      password => "${LS_ELASTICSEARCH_PASSWORD}"
      validate_after_inactivity => 60
    }
}

I redacted filters etc, pl

Provide logs (if relevant): The following log lines get repeated several times a second

[2023-06-01T14:35:33,894][WARN ][org.logstash.common.io.DeadLetterQueueWriter] unable to finalize segment
java.nio.file.NoSuchFileException: /usr/share/logstash/data/dead_letter_queue/main/10030.log.tmp -> /usr/share/logstash/data/dead_letter_queue/main/10030.log
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:416) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:266) ~[?:?]
        at java.nio.file.Files.move(Files.java:1432) ~[?:?]
        at org.logstash.common.io.DeadLetterQueueWriter.sealSegment(DeadLetterQueueWriter.java:500) ~[logstash-core.jar:?]
        at org.logstash.common.io.DeadLetterQueueWriter.finalizeSegment(DeadLetterQueueWriter.java:486) ~[logstash-core.jar:?]
        at org.logstash.common.io.DeadLetterQueueWriter.flushCheck(DeadLetterQueueWriter.java:462) ~[logstash-core.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
yaauie commented 1 year ago

Is this an artificial reproduction? Are we expecting the events that are rejected by your output to get re-injected into the same pipeline?

ukho-cfreeman commented 1 year ago

Hi @yaauie

I didn't include the full configuration which hasn't helped, my apologies. When events get rejected they get re-ingested via the DLQ input plugin but are modified to include the DLQ reason and converted to a string.

To confirm this is happening in all live environments.

input {
  kinesis {
    kinesis_stream_name => "stream_name"
    initial_position_in_stream => "LATEST"
    application_name => "application_name"
    region => "eu-west-2"
    codec => json { ecs_compatibility => "disabled" }
    additional_settings => {"initial_lease_table_read_capacity" => 20 "initial_lease_table_write_capacity" => 100}
  }
}

input {
  dead_letter_queue {
    id => "kubernetes_dlq"
    path => "/usr/share/logstash/data/dead_letter_queue"
    sincedb_path => "/usr/share/logstash/data/sincedb_dlq"
    commit_offsets => true
    clean_consumed => true
    tags => ["dlq"]
  }
}

filter {
  if "dlq" in [tags] {
    json_encode {
      source => "message_json"
      target => "message"
      remove_field => [ "message_json" ]
      remove_tag => ["kubernetes", "docker", "kubernetes_filtered"]
    }

    mutate {
      add_field => { "dlq_reason" => "%{[@metadata][dead_letter_queue][reason]}" }
    }
  }
}

output {
    opensearch {
      id => "kubernetes_es"
      index => "%{[@metadata][index_field]}"
      hosts => [ "endpoint" ]
      manage_template => false
      ssl => true
      timeout => 200
      retry_initial_interval => 100
      retry_max_interval => 900
      user => "${LS_ELASTICSEARCH_USER}"
      password => "${LS_ELASTICSEARCH_PASSWORD}"
      validate_after_inactivity => 60
    }
}
vosmax commented 1 year ago

Having the same with 8.6.2 version

[WARN ] 2023-06-16 20:34:32.341 [dlq-flush-check] DeadLetterQueueWriter - unable to finalize segment
java.nio.file.NoSuchFileException: /usr/share/logstash/data/dead_letter_queue/app-logs-streams/5379.log.tmp -> /usr/share/logstash/data/dead_letter_queue/app-logs-streams/5379.log
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
    at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:416) ~[?:?]
    at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:266) ~[?:?]
    at java.nio.file.Files.move(Files.java:1432) ~[?:?]
    at org.logstash.common.io.DeadLetterQueueWriter.finalizeSegment(DeadLetterQueueWriter.java:469) ~[logstash-core.jar:?]
    at org.logstash.common.io.DeadLetterQueueWriter.flushCheck(DeadLetterQueueWriter.java:446) ~[logstash-core.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
$ ls -la /usr/share/logstash/data/dead_letter_queue/app-logs-streams/
total 128
drwxr-sr-x 2 logstash nogroup     58 Jun 16 17:45 .
drwxr-sr-x 7 logstash nogroup    112 Jun 15 12:55 ..
-rw-r--r-- 1 logstash nogroup 130973 Jun 16 17:45 5379.log
-rw-r--r-- 1 logstash nogroup      0 Jun 15 12:55 dlq_reader.lock
-rw-r--r-- 1 logstash nogroup      0 Jun 15 12:55 .lock

In another pod, I have the different situation

[WARN ] 2023-06-16 20:46:10.278 [dlq-flush-check] DeadLetterQueueWriter - unable to finalize segment
java.nio.file.NoSuchFileException: /usr/share/logstash/data/dead_letter_queue/app-logs-streams/3259.log.tmp -> /usr/share/logstash/data/dead_letter_queue/app-logs-streams/3259.log
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
    at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:416) ~[?:?]
    at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:266) ~[?:?]
    at java.nio.file.Files.move(Files.java:1432) ~[?:?]
    at org.logstash.common.io.DeadLetterQueueWriter.finalizeSegment(DeadLetterQueueWriter.java:469) ~[logstash-core.jar:?]
    at org.logstash.common.io.DeadLetterQueueWriter.flushCheck(DeadLetterQueueWriter.java:446) ~[logstash-core.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
$ ls -la /usr/share/logstash/data/dead_letter_queue/app-logs-streams/
total 0
drwxr-sr-x 2 logstash nogroup  42 Jun 16 10:15 .
drwxr-sr-x 7 logstash nogroup 112 Jun 15 12:55 ..
-rw-r--r-- 1 logstash nogroup   0 Jun 15 12:55 dlq_reader.lock
-rw-r--r-- 1 logstash nogroup   0 Jun 15 12:55 .lock
ukho-cfreeman commented 1 year ago

Hi @vosmax , is it possible to get your logstash config/arguments/env variables to see if there is any correlation?

vosmax commented 1 year ago

Hi @ukho-cfreeman, sure logstash.yml

        log.level: info
        http.host: "0.0.0.0"
        xpack.monitoring.enabled: false
        xpack.monitoring.collection.pipeline.details.enabled: false
        dead_letter_queue.enable: true
        dead_letter_queue.retain.age: 4h
        path.dead_letter_queue: "/usr/share/logstash/data/dead_letter_queue"

dlq.yml

        input {
          dead_letter_queue {
            path => "/usr/share/logstash/data/dead_letter_queue"
            commit_offsets => true
            clean_consumed => true
            pipeline_id => "app-logs-streams"
          }
        }
        filter {
          mutate {
            add_field => {
              "error_reason" => "%%{[@metadata][dead_letter_queue][reason]}"
              "error_type" => "%%{[@metadata][dead_letter_queue][type]}"
            }
          }
          prune {
            whitelist_names => ["@timestamp","error","kubernetes","message"]
          }
        }
        output {
          elasticsearch {
            hosts => [ <host>]
              ssl => true
              ssl_certificate_verification => false
              user => "<user>"
              password => "<password>"
              index => "<index_name>"
              manage_template => false
              ilm_enabled => false
              template_name => "<tpl_name>"
              action => "create"
          }
        }

In env vars only ES configuration

ukho-cfreeman commented 1 year ago

Thanks @vosmax , there doesn't seem to be any indication that this is based on our configuration. Would it be possible to see your dockerfile? Or if there is any correlation with mine:

FROM docker.elastic.co/logstash/logstash:8.7.1

RUN logstash-plugin install --version 3.0.3 logstash-filter-json_encode
RUN logstash-plugin install --version 2.2.1 logstash-input-kinesis
RUN logstash-plugin install --version 2.0.0 logstash-output-opensearch

USER root
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
ENTRYPOINT ["/tini", "--"]

USER logstash
CMD ["/usr/local/bin/docker-entrypoint", "--config.reload.automatic", "-f", "/logstash/conf.d/config/logstash.conf"]
turbotankist commented 1 year ago

the same problem in 8.8.2 the support (the worst I've met) gave me the link for this issue and can't say anything.