apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.39k stars 3.68k forks source link

Kafka Index Tasks Created -> Duplicated -> Pending -> Failed #15163

Closed CoinCoderBuffalo closed 2 weeks ago

CoinCoderBuffalo commented 10 months ago

Affected Version

apache/druid:25.0.0

Description

I'm running Druid in Kubernetes with the Druid Operator. I create 7 Kafka Supervisors. This in turn creates 7 Kafka indexing tasks. These all work fine for about an hour, then the 7 Kafka tasks get duplicated for some reason in a "pending" status and a little later the original 7 tasks change to "failed".

Original 7 running fine for an hour: image

After 1 hour 7 duplicates created somehow all in a "pending" status: image

About 10 minutes later all are in a "failed" state:

image

I've checked the coordinator and middlemanager logs and don't see any problems reported.

Coordinator log sample:

TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.config.AbstractConfig
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.kafka.common.utils.LogContext
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl

middle manager log sample:

TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl

middleManagers config:

middlemanagers:
      nodeType: "middleManager"
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/data/middleManager"
      druid.port: 8088
      services:
        - spec:
            type: ClusterIP
            clusterIP: None
      replicas: 1
      extra.jvm.options: |-
        -Xmx512m
        -Xms512m
      runtime.properties: |
        druid.service=druid/middleManager
        druid.worker.capacity=10
        druid.indexer.runner.javaOpts=-server -Xms2g -Xmx2g -XX:MaxDirectMemorySize=6g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/druid/data/tmp -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
        druid.indexer.task.baseTaskDir=/druid/data/baseTaskDir
        druid.server.http.numThreads=10
        druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1
        druid.indexer.fork.property.druid.processing.numMergeBuffers=1
        druid.indexer.fork.property.druid.processing.numThreads=1
CoinCoderBuffalo commented 10 months ago

I made one configuration change to the indexer.runner.type and the behavior is different now:

# Configure this coordinator to also run as Overlord
        druid.coordinator.asOverlord.enabled=true
        druid.coordinator.asOverlord.overlordService=druid/overlord
        druid.indexer.queue.startDelay=PT30S
        druid.indexer.runner.type=remote

The kafka tasks still respawn every hour, but they succeed this time. Is this normal behavior to have kafka tasks respawn? It didn't behave this way in previous versions of Druid.

image
github-actions[bot] commented 1 month ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions[bot] commented 2 weeks ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.