druid-io / druid-operator

Druid Kubernetes Operator
Other
205 stars 92 forks source link

MiddleManagers keep disappearing #317

Closed winsmith closed 8 months ago

winsmith commented 1 year ago

Hi. I'm currently evaluating running Druid on a Kubernetes Cluster instead of bare metal. Besides druid-operator I use druid-kubernetes-extensions instead of Zookeeper.

I currently have 1 coordinator and router, and 2 brokers, 2 historicals and 2 middlemanagers set up. I'm pretty sure I'm going to scale this up later – right now my objective is to load my data into this cluster and see where it needs more performance. I ingest data using Kafka.

The Problem

Every time I start an ingestion task, right when the task finishes, the MiddleManager that runs the task instead vanishes from the "Services" list. It's just gone. The MiddleManager is no longer available for the overlord to run tasks on. The fact that the task has completed never reaches the overlord, and instead the overlord kills the task after a while as failed (which, curiously enough, reaches the MiddleManager). However, the logs on the middiemanager show no errors and indicate that the task has been completed.

How do I start debugging this? I didn't find any error messages to google, and the fact that there are no errors in the middlemanagers makes it hard to find a specific clue.

MiddleManager Configuration


    middlemanagers:
      druid.port: 8088
      nodeType: "middleManager"
      nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
      podDisruptionBudgetSpec:
        maxUnavailable: 1
      replicas: 1
      resources:
        limits:
          cpu: "2"
          memory: 15Gi
        requests:
          cpu: "2"
          memory: 15Gi
      hpAutoscaler:
        maxReplicas: 5
        minReplicas: 1
        scaleTargetRef:
          apiVersion: apps/v1
          kind: StatefulSet
          name: druid-druid-cluster-middlemanagers
        metrics:
          - type: Resource
            resource:
              name: cpu
              target:
                type: Utilization
                averageUtilization: 60
          - type: Resource
            resource:
              name: memory
              target:
                type: Utilization
                averageUtilization: 60
      livenessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8088
      readinessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8088
      runtime.properties: |
        druid.service=druid/middleManager
        druid.worker.capacity=4
        druid.indexer.runner.javaOpts=-server -XX:MaxDirectMemorySize=10240g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/druid/data/tmp -Dlog4j.debug -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:HeapDumpPath=/druid/data/logs/peon.%t.%p.hprof -Xms10G -Xmx10G
        druid.indexer.task.baseTaskDir=var/druid/task
        # HTTP server threads
        druid.server.http.numThreads=10
        druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1
        druid.indexer.fork.property.druid.processing.numMergeBuffers=1
        druid.indexer.fork.property.druid.processing.numThreads=1
        # Processing threads and buffers on Peons
        druid.indexer.fork.property.druid.processing.numMergeBuffers=2
        druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000
        druid.indexer.fork.property.druid.processing.numThreads=1
      extra.jvm.options: |-
        -Xmx4G
        -Xms4G

MiddleManager Logs

...
2022-10-31T07:13:50,561 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Running command: java -cp /tmp/conf/druid/cluster/_common:/tmp/conf/druid/cluster/data/middleManager:lib/aws-java-sdk-sts-1.12.264.jar:lib/commons-codec-1.13.jar:lib/shims-0.9.0.jar:lib/stax-ex-1.8.jar:lib/javax.activation-api-1.2.0.jar:lib/guava-16.0.1.jar:lib/guice-4.1.0.jar:lib/vavr-0.10.2.jar:lib/airline-2.8.4.jar:lib/asm-tree-9.3.jar:lib/maven-aether-provider-3.1.1.jar:lib/commons-text-1.3.jar:lib/maven-model-builder-3.1.1.jar:lib/netty-codec-dns-4.1.68.Final.jar:lib/javax.activation-1.2.0.jar:lib/druid-core-24.0.0.jar:lib/jetty-server-9.4.48.v20220622.jar:lib/istack-commons-runtime-3.0.7.jar:lib/caffeine-2.8.0.jar:lib/jcl-over-slf4j-1.7.36.jar:lib/protobuf-java-3.11.0.jar:lib/spymemcached-2.12.3.jar:lib/commons-compiler-3.0.11.jar:lib/aether-api-0.9.0.M2.jar:lib/jetty-client-9.4.48.v20220622.jar:lib/aopalliance-1.0.jar:lib/jetty-continuation-9.4.48.v20220622.jar:lib/antlr4-runtime-4.5.1.jar:lib/commons-pool2-2.2.jar:lib/plexus-utils-3.0.24.jar:lib/reactive-streams-1.0.2.jar:lib/netty-handler-4.1.68.Final.jar:lib/jersey-guice-1.19.4.jar:lib/fastutil-8.5.4.jar:lib/ipaddress-5.3.4.jar:lib/derbynet-10.14.2.0.jar:lib/netty-handler-proxy-4.1.68.Final.jar:lib/resilience4j-core-1.3.1.jar:lib/aws-java-sdk-core-1.12.264.jar:lib/httpcore-4.4.11.jar:lib/netty-resolver-4.1.68.Final.jar:lib/curator-client-4.3.0.jar:lib/rhino-1.7.11.jar:lib/RoaringBitmap-0.9.0.jar:lib/druid-console-24.0.0.jar:lib/google-api-client-1.26.0.jar:lib/validation-api-1.1.0.Final.jar:lib/calcite-linq4j-1.21.0.jar:lib/commons-io-2.11.0.jar:lib/FastInfoset-1.2.15.jar:lib/aggdesigner-algorithm-6.0.jar:lib/netty-buffer-4.1.68.Final.jar:lib/netty-transport-4.1.68.Final.jar:lib/aws-java-sdk-s3-1.12.264.jar:lib/google-http-client-jackson2-1.26.0.jar:lib/tesla-aether-0.0.5.jar:lib/jsr305-2.0.1.jar:lib/netty-common-4.1.68.Final.jar:lib/async-http-client-2.5.3.jar:lib/commons-beanutils-1.9.4.jar:lib/json-path-2.3.0.jar:lib/lz4-java-1.8.0.jar:lib/jetty-http-9.4.48.v20220622.jar:lib/avatica-metrics-1.17.0.jar:lib/druid-services-24.0.0.jar:lib/ion-java-1.0.2.jar:lib/jackson-jaxrs-base-2.10.5.jar:lib/jackson-module-guice-2.10.5.jar:lib/checker-qual-2.5.7.jar:lib/druid-hll-24.0.0.jar:lib/javax.servlet-api-3.1.0.jar:lib/druid-sql-24.0.0.jar:lib/joda-time-2.10.5.jar:lib/slf4j-api-1.7.36.jar:lib/commons-math3-3.6.1.jar:lib/netty-codec-4.1.68.Final.jar:lib/netty-codec-socks-4.1.68.Final.jar:lib/jboss-logging-3.2.1.Final.jar:lib/compress-lzf-1.0.4.jar:lib/jackson-jaxrs-smile-provider-2.10.5.jar:lib/fastutil-extra-8.5.4.jar:lib/log4j-1.2-api-2.18.0.jar:lib/jsr311-api-1.1.1.jar:lib/datasketches-memory-2.0.0.jar:lib/fastutil-core-8.5.4.jar:lib/hibernate-validator-5.2.5.Final.jar:lib/opencsv-4.6.jar:lib/error_prone_annotations-2.11.0.jar:lib/maven-settings-builder-3.1.1.jar:lib/jackson-databind-2.10.5.1.jar:lib/jackson-dataformat-cbor-2.10.5.jar:lib/guice-assistedinject-4.1.0.jar:lib/druid-indexing-service-24.0.0.jar:lib/log4j-jul-2.18.0.jar:lib/classmate-1.1.0.jar:lib/commons-dbcp2-2.0.1.jar:lib/maven-artifact-3.6.0.jar:lib/aws-java-sdk-kms-1.12.264.jar:lib/aether-spi-0.9.0.M2.jar:lib/wagon-provider-api-2.4.jar:lib/asm-analysis-9.3.jar:lib/asm-commons-9.3.jar:lib/config-magic-0.9.jar:lib/maven-repository-metadata-3.1.1.jar:lib/guice-servlet-4.1.0.jar:lib/jaxb-api-2.3.1.jar:lib/google-oauth-client-1.26.0.jar:lib/aether-util-0.9.0.M2.jar:lib/avatica-server-1.17.0.jar:lib/httpclient-4.5.13.jar:lib/jackson-jq-0.0.10.jar:lib/jersey-core-1.19.4.jar:lib/druid-processing-24.0.0.jar:lib/jackson-dataformat-smile-2.10.5.jar:lib/druid-gcp-common-24.0.0.jar:lib/j2objc-annotations-1.1.jar:lib/jetty-util-ajax-9.4.48.v20220622.jar:lib/calcite-core-1.21.0.jar:lib/commons-lang-2.6.jar:lib/log4j-api-2.18.0.jar:lib/json-smart-2.3.jar:lib/txw2-2.3.1.jar:lib/zookeeper-jute-3.5.9.jar:lib/aether-connector-okhttp-0.0.9.jar:lib/netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:lib/jakarta.activation-api-1.2.1.jar:lib/datasketches-java-3.2.0.jar:lib/jackson-annotations-2.10.5.jar:lib/jetty-rewrite-9.4.48.v20220622.jar:lib/jetty-util-9.4.48.v20220622.jar:lib/google-http-client-1.26.0.jar:lib/extendedset-24.0.0.jar:lib/netty-3.10.6.Final.jar:lib/jackson-jaxrs-json-provider-2.10.5.jar:lib/plexus-interpolation-1.19.jar:lib/jaxb-runtime-2.3.1.jar:lib/commons-collections-3.2.2.jar:lib/resilience4j-bulkhead-1.3.1.jar:lib/maven-model-3.1.1.jar:lib/jakarta.inject-api-1.0.3.jar:lib/curator-framework-4.3.0.jar:lib/jcodings-1.0.43.jar:lib/aws-java-sdk-ec2-1.12.264.jar:lib/cron-scheduler-0.1.jar:lib/jvm-attach-api-1.5.jar:lib/derby-10.14.2.0.jar:lib/accessors-smart-1.2.jar:lib/async-http-client-netty-utils-2.5.3.jar:lib/jetty-proxy-9.4.48.v20220622.jar:lib/xz-1.8.jar:lib/icu4j-55.1.jar:lib/maven-settings-3.1.1.jar:lib/jdbi-2.63.1.jar:lib/disruptor-3.3.6.jar:lib/druid-server-24.0.0.jar:lib/vavr-match-0.10.2.jar:lib/netty-resolver-dns-4.1.68.Final.jar:lib/jetty-security-9.4.48.v20220622.jar:lib/javax.el-api-3.0.0.jar:lib/zstd-jni-1.5.2-3.jar:lib/audience-annotations-0.5.0.jar:lib/javax.el-3.0.0.jar:lib/jackson-core-2.10.5.jar:lib/metrics-core-4.0.0.jar:lib/commons-lang3-3.8.1.jar:lib/jetty-io-9.4.48.v20220622.jar:lib/curator-recipes-4.3.0.jar:lib/esri-geometry-api-2.2.0.jar:lib/javax.inject-1.jar:lib/aether-impl-0.9.0.M2.jar:lib/jackson-datatype-guava-2.10.5.jar:lib/jna-4.5.1.jar:lib/jmespath-java-1.12.264.jar:lib/commons-collections4-4.2.jar:lib/jersey-server-1.19.4.jar:lib/jackson-module-jaxb-annotations-2.10.5.jar:lib/commons-compress-1.21.jar:lib/log4j-slf4j-impl-2.18.0.jar:lib/commons-logging-1.1.1.jar:lib/avatica-core-1.17.0.jar:lib/curator-x-discovery-4.3.0.jar:lib/zookeeper-3.5.9.jar:lib/jetty-servlets-9.4.48.v20220622.jar:lib/netty-reactive-streams-2.0.0.jar:lib/guice-multibindings-4.1.0.jar:lib/jetty-servlet-9.4.48.v20220622.jar:lib/netty-codec-http-4.1.68.Final.jar:lib/druid-aws-common-24.0.0.jar:lib/jakarta.xml.bind-api-2.3.2.jar:lib/okhttp-1.0.2.jar:lib/jackson-datatype-joda-2.10.5.jar:lib/log4j-core-2.18.0.jar:lib/sigar-1.6.5.132.jar:lib/netty-transport-native-unix-common-4.1.68.Final.jar:lib/jersey-servlet-1.19.4.jar:lib/druid-indexing-hadoop-24.0.0.jar:lib/asm-9.3.jar:lib/janino-3.0.11.jar:lib/joni-2.1.27.jar:lib/derbyclient-10.14.2.0.jar:lib/airline-io-2.8.4.jar:lib/aether-connector-file-0.9.0.M2.jar: -XX:ActiveProcessorCount=4 -server -XX:MaxDirectMemorySize=10240g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/druid/data/tmp -Dlog4j.debug -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:HeapDumpPath=/druid/data/logs/peon.%t.%p.hprof -Xms1G -Xmx1G -Ddruid.indexer.task.baseTaskDir=var/druid/task -Ddruid.host=10.244.23.36 -Ddruid.serverview.type=http -Ddruid.metadata.storage.connector.password=<masked> -Ddruid.indexer.fork.property.druid.processing.numThreads=1 -Ddruid.azure.key=YKzeQlpW76zXIgBmLTMO/iFWsplQTvf+MYff4s4UxpPGYswH8msv2fU56VNkwCtVjXmTpfipjknf+AStLuA5Ng== -Ddruid.indexer.fork.property.druid.processing.buffer.sizeBytes=32000000 -Ddruid.discovery.k8s.clusterIdentifier=td-cube-cluster -Duser.timezone=UTC -Dfile.encoding.pkg=sun.io -Ddruid.selectors.coordinator.serviceName=druid/coordinator -Ddruid.selectors.indexing.serviceName=druid/overlord -Ddruid.lookup.enableLookupSyncOnStartup=false -Ddruid.port=8088 -Ddruid.server.http.numThreads=1 -Ddruid.worker.capacity=2 -Ddruid.azure.container=druid -Ddruid.azure.account=telemetrydeckcoldstorage -Ddruid.zk.service.enabled=false -Ddruid.service=druid/middleManager -Ddruid.metadata.storage.connector.user=telemetryadmin -Ddruid.indexer.runner.type=httpRemote -Ddruid.metadata.storage.type=postgresql -Ddruid.metadata.storage.connector.connectURI=jdbc:postgresql://playground-postgres-metadata-public.postgres.database.azure.com:5432/druid -Ddruid.coordinator.loadqueuepeon.type=http -Djava.io.tmpdir=/druid/data -Ddruid.extensions.loadList=["druid-kubernetes-extensions", "druid-kafka-indexing-service", "druid-azure-extensions", "postgresql-metadata-storage", "druid-datasketches"] -Ddruid.discovery.type=k8s -Ddruid.indexer.logs.directory=/druid/data/indexing-logs -Dfile.encoding=UTF-8 -Ddruid.storage.type=azure -Ddruid.indexer.fork.property.druid.processing.numMergeBuffers=2 -Ddruid.indexer.logs.type=file -Ddruid.processing.numThreads=1 -Ddruid.processing.buffer.sizeBytes=32000000 -Ddruid.processing.numMergeBuffers=2 -Ddruid.metrics.emitter.dimension.dataSource=telemetry-signals -Ddruid.metrics.emitter.dimension.taskId=index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd -Ddruid.metrics.emitter.dimension.taskType=index_kafka -Ddruid.host=10.244.23.36 -Ddruid.plaintextPort=8100 -Ddruid.tlsPort=-1 -Ddruid.task.executor.service=druid/middleManager -Ddruid.task.executor.host=10.244.23.36 -Ddruid.task.executor.plaintextPort=8088 -Ddruid.task.executor.enablePlaintextPort=true -Ddruid.task.executor.tlsPort=-1 -Ddruid.task.executor.enableTlsPort=false -Dlog4j2.configurationFactory=org.apache.druid.indexing.common.tasklogs.ConsoleLoggingEnforcementConfigurationFactory org.apache.druid.cli.Main internal peon var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/task.json var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/93fb2af6-0838-4249-afa5-b27e1fd46818/status.json var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/93fb2af6-0838-4249-afa5-b27e1fd46818/report.json --loadBroadcastSegments true
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
2022-10-31T07:13:50,567 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd output to: var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd/log
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
2022-10-31T07:13:50,637 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task[coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z] started.
2022-10-31T07:13:50,640 INFO [forking-task-runner-1] org.apache.druid.indexing.overlord.ForkingTaskRunner - Running command: java -cp /tmp/conf/druid/cluster/_common:/tmp/conf/druid/cluster/data/middleManager:lib/aws-java-sdk-sts-1.12.264.jar:lib/commons-codec-1.13.jar:lib/shims-0.9.0.jar:lib/stax-ex-1.8.jar:lib/javax.activation-api-1.2.0.jar:lib/guava-16.0.1.jar:lib/guice-4.1.0.jar:lib/vavr-0.10.2.jar:lib/airline-2.8.4.jar:lib/asm-tree-9.3.jar:lib/maven-aether-provider-3.1.1.jar:lib/commons-text-1.3.jar:lib/maven-model-builder-3.1.1.jar:lib/netty-codec-dns-4.1.68.Final.jar:lib/javax.activation-1.2.0.jar:lib/druid-core-24.0.0.jar:lib/jetty-server-9.4.48.v20220622.jar:lib/istack-commons-runtime-3.0.7.jar:lib/caffeine-2.8.0.jar:lib/jcl-over-slf4j-1.7.36.jar:lib/protobuf-java-3.11.0.jar:lib/spymemcached-2.12.3.jar:lib/commons-compiler-3.0.11.jar:lib/aether-api-0.9.0.M2.jar:lib/jetty-client-9.4.48.v20220622.jar:lib/aopalliance-1.0.jar:lib/jetty-continuation-9.4.48.v20220622.jar:lib/antlr4-runtime-4.5.1.jar:lib/commons-pool2-2.2.jar:lib/plexus-utils-3.0.24.jar:lib/reactive-streams-1.0.2.jar:lib/netty-handler-4.1.68.Final.jar:lib/jersey-guice-1.19.4.jar:lib/fastutil-8.5.4.jar:lib/ipaddress-5.3.4.jar:lib/derbynet-10.14.2.0.jar:lib/netty-handler-proxy-4.1.68.Final.jar:lib/resilience4j-core-1.3.1.jar:lib/aws-java-sdk-core-1.12.264.jar:lib/httpcore-4.4.11.jar:lib/netty-resolver-4.1.68.Final.jar:lib/curator-client-4.3.0.jar:lib/rhino-1.7.11.jar:lib/RoaringBitmap-0.9.0.jar:lib/druid-console-24.0.0.jar:lib/google-api-client-1.26.0.jar:lib/validation-api-1.1.0.Final.jar:lib/calcite-linq4j-1.21.0.jar:lib/commons-io-2.11.0.jar:lib/FastInfoset-1.2.15.jar:lib/aggdesigner-algorithm-6.0.jar:lib/netty-buffer-4.1.68.Final.jar:lib/netty-transport-4.1.68.Final.jar:lib/aws-java-sdk-s3-1.12.264.jar:lib/google-http-client-jackson2-1.26.0.jar:lib/tesla-aether-0.0.5.jar:lib/jsr305-2.0.1.jar:lib/netty-common-4.1.68.Final.jar:lib/async-http-client-2.5.3.jar:lib/commons-beanutils-1.9.4.jar:lib/json-path-2.3.0.jar:lib/lz4-java-1.8.0.jar:lib/jetty-http-9.4.48.v20220622.jar:lib/avatica-metrics-1.17.0.jar:lib/druid-services-24.0.0.jar:lib/ion-java-1.0.2.jar:lib/jackson-jaxrs-base-2.10.5.jar:lib/jackson-module-guice-2.10.5.jar:lib/checker-qual-2.5.7.jar:lib/druid-hll-24.0.0.jar:lib/javax.servlet-api-3.1.0.jar:lib/druid-sql-24.0.0.jar:lib/joda-time-2.10.5.jar:lib/slf4j-api-1.7.36.jar:lib/commons-math3-3.6.1.jar:lib/netty-codec-4.1.68.Final.jar:lib/netty-codec-socks-4.1.68.Final.jar:lib/jboss-logging-3.2.1.Final.jar:lib/compress-lzf-1.0.4.jar:lib/jackson-jaxrs-smile-provider-2.10.5.jar:lib/fastutil-extra-8.5.4.jar:lib/log4j-1.2-api-2.18.0.jar:lib/jsr311-api-1.1.1.jar:lib/datasketches-memory-2.0.0.jar:lib/fastutil-core-8.5.4.jar:lib/hibernate-validator-5.2.5.Final.jar:lib/opencsv-4.6.jar:lib/error_prone_annotations-2.11.0.jar:lib/maven-settings-builder-3.1.1.jar:lib/jackson-databind-2.10.5.1.jar:lib/jackson-dataformat-cbor-2.10.5.jar:lib/guice-assistedinject-4.1.0.jar:lib/druid-indexing-service-24.0.0.jar:lib/log4j-jul-2.18.0.jar:lib/classmate-1.1.0.jar:lib/commons-dbcp2-2.0.1.jar:lib/maven-artifact-3.6.0.jar:lib/aws-java-sdk-kms-1.12.264.jar:lib/aether-spi-0.9.0.M2.jar:lib/wagon-provider-api-2.4.jar:lib/asm-analysis-9.3.jar:lib/asm-commons-9.3.jar:lib/config-magic-0.9.jar:lib/maven-repository-metadata-3.1.1.jar:lib/guice-servlet-4.1.0.jar:lib/jaxb-api-2.3.1.jar:lib/google-oauth-client-1.26.0.jar:lib/aether-util-0.9.0.M2.jar:lib/avatica-server-1.17.0.jar:lib/httpclient-4.5.13.jar:lib/jackson-jq-0.0.10.jar:lib/jersey-core-1.19.4.jar:lib/druid-processing-24.0.0.jar:lib/jackson-dataformat-smile-2.10.5.jar:lib/druid-gcp-common-24.0.0.jar:lib/j2objc-annotations-1.1.jar:lib/jetty-util-ajax-9.4.48.v20220622.jar:lib/calcite-core-1.21.0.jar:lib/commons-lang-2.6.jar:lib/log4j-api-2.18.0.jar:lib/json-smart-2.3.jar:lib/txw2-2.3.1.jar:lib/zookeeper-jute-3.5.9.jar:lib/aether-connector-okhttp-0.0.9.jar:lib/netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:lib/jakarta.activation-api-1.2.1.jar:lib/datasketches-java-3.2.0.jar:lib/jackson-annotations-2.10.5.jar:lib/jetty-rewrite-9.4.48.v20220622.jar:lib/jetty-util-9.4.48.v20220622.jar:lib/google-http-client-1.26.0.jar:lib/extendedset-24.0.0.jar:lib/netty-3.10.6.Final.jar:lib/jackson-jaxrs-json-provider-2.10.5.jar:lib/plexus-interpolation-1.19.jar:lib/jaxb-runtime-2.3.1.jar:lib/commons-collections-3.2.2.jar:lib/resilience4j-bulkhead-1.3.1.jar:lib/maven-model-3.1.1.jar:lib/jakarta.inject-api-1.0.3.jar:lib/curator-framework-4.3.0.jar:lib/jcodings-1.0.43.jar:lib/aws-java-sdk-ec2-1.12.264.jar:lib/cron-scheduler-0.1.jar:lib/jvm-attach-api-1.5.jar:lib/derby-10.14.2.0.jar:lib/accessors-smart-1.2.jar:lib/async-http-client-netty-utils-2.5.3.jar:lib/jetty-proxy-9.4.48.v20220622.jar:lib/xz-1.8.jar:lib/icu4j-55.1.jar:lib/maven-settings-3.1.1.jar:lib/jdbi-2.63.1.jar:lib/disruptor-3.3.6.jar:lib/druid-server-24.0.0.jar:lib/vavr-match-0.10.2.jar:lib/netty-resolver-dns-4.1.68.Final.jar:lib/jetty-security-9.4.48.v20220622.jar:lib/javax.el-api-3.0.0.jar:lib/zstd-jni-1.5.2-3.jar:lib/audience-annotations-0.5.0.jar:lib/javax.el-3.0.0.jar:lib/jackson-core-2.10.5.jar:lib/metrics-core-4.0.0.jar:lib/commons-lang3-3.8.1.jar:lib/jetty-io-9.4.48.v20220622.jar:lib/curator-recipes-4.3.0.jar:lib/esri-geometry-api-2.2.0.jar:lib/javax.inject-1.jar:lib/aether-impl-0.9.0.M2.jar:lib/jackson-datatype-guava-2.10.5.jar:lib/jna-4.5.1.jar:lib/jmespath-java-1.12.264.jar:lib/commons-collections4-4.2.jar:lib/jersey-server-1.19.4.jar:lib/jackson-module-jaxb-annotations-2.10.5.jar:lib/commons-compress-1.21.jar:lib/log4j-slf4j-impl-2.18.0.jar:lib/commons-logging-1.1.1.jar:lib/avatica-core-1.17.0.jar:lib/curator-x-discovery-4.3.0.jar:lib/zookeeper-3.5.9.jar:lib/jetty-servlets-9.4.48.v20220622.jar:lib/netty-reactive-streams-2.0.0.jar:lib/guice-multibindings-4.1.0.jar:lib/jetty-servlet-9.4.48.v20220622.jar:lib/netty-codec-http-4.1.68.Final.jar:lib/druid-aws-common-24.0.0.jar:lib/jakarta.xml.bind-api-2.3.2.jar:lib/okhttp-1.0.2.jar:lib/jackson-datatype-joda-2.10.5.jar:lib/log4j-core-2.18.0.jar:lib/sigar-1.6.5.132.jar:lib/netty-transport-native-unix-common-4.1.68.Final.jar:lib/jersey-servlet-1.19.4.jar:lib/druid-indexing-hadoop-24.0.0.jar:lib/asm-9.3.jar:lib/janino-3.0.11.jar:lib/joni-2.1.27.jar:lib/derbyclient-10.14.2.0.jar:lib/airline-io-2.8.4.jar:lib/aether-connector-file-0.9.0.M2.jar: -XX:ActiveProcessorCount=4 -server -XX:MaxDirectMemorySize=10240g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/druid/data/tmp -Dlog4j.debug -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:HeapDumpPath=/druid/data/logs/peon.%t.%p.hprof -Xms1G -Xmx1G -Ddruid.indexer.task.baseTaskDir=var/druid/task -Ddruid.host=10.244.23.36 -Ddruid.serverview.type=http -Ddruid.metadata.storage.connector.password=<masked> -Ddruid.indexer.fork.property.druid.processing.numThreads=1 -Ddruid.azure.key=YKzeQlpW76zXIgBmLTMO/iFWsplQTvf+MYff4s4UxpPGYswH8msv2fU56VNkwCtVjXmTpfipjknf+AStLuA5Ng== -Ddruid.indexer.fork.property.druid.processing.buffer.sizeBytes=32000000 -Ddruid.discovery.k8s.clusterIdentifier=td-cube-cluster -Duser.timezone=UTC -Dfile.encoding.pkg=sun.io -Ddruid.selectors.coordinator.serviceName=druid/coordinator -Ddruid.selectors.indexing.serviceName=druid/overlord -Ddruid.lookup.enableLookupSyncOnStartup=false -Ddruid.port=8088 -Ddruid.server.http.numThreads=1 -Ddruid.worker.capacity=2 -Ddruid.azure.container=druid -Ddruid.azure.account=telemetrydeckcoldstorage -Ddruid.zk.service.enabled=false -Ddruid.service=druid/middleManager -Ddruid.metadata.storage.connector.user=telemetryadmin -Ddruid.indexer.runner.type=httpRemote -Ddruid.metadata.storage.type=postgresql -Ddruid.metadata.storage.connector.connectURI=jdbc:postgresql://playground-postgres-metadata-public.postgres.database.azure.com:5432/druid -Ddruid.coordinator.loadqueuepeon.type=http -Djava.io.tmpdir=/druid/data -Ddruid.extensions.loadList=["druid-kubernetes-extensions", "druid-kafka-indexing-service", "druid-azure-extensions", "postgresql-metadata-storage", "druid-datasketches"] -Ddruid.discovery.type=k8s -Ddruid.indexer.logs.directory=/druid/data/indexing-logs -Dfile.encoding=UTF-8 -Ddruid.storage.type=azure -Ddruid.indexer.fork.property.druid.processing.numMergeBuffers=2 -Ddruid.indexer.logs.type=file -Ddruid.processing.numThreads=1 -Ddruid.processing.buffer.sizeBytes=32000000 -Ddruid.processing.numMergeBuffers=2 -Ddruid.metrics.emitter.dimension.dataSource=telemetry-signals -Ddruid.metrics.emitter.dimension.taskId=coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z -Ddruid.metrics.emitter.dimension.taskType=compact -Ddruid.host=10.244.23.36 -Ddruid.plaintextPort=8101 -Ddruid.tlsPort=-1 -Ddruid.task.executor.service=druid/middleManager -Ddruid.task.executor.host=10.244.23.36 -Ddruid.task.executor.plaintextPort=8088 -Ddruid.task.executor.enablePlaintextPort=true -Ddruid.task.executor.tlsPort=-1 -Ddruid.task.executor.enableTlsPort=false -Dlog4j2.configurationFactory=org.apache.druid.indexing.common.tasklogs.ConsoleLoggingEnforcementConfigurationFactory org.apache.druid.cli.Main internal peon var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/task.json var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/308003fb-ebdc-497c-968e-6f5a04fa9cdd/status.json var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/308003fb-ebdc-497c-968e-6f5a04fa9cdd/report.json
2022-10-31T07:13:50,647 INFO [forking-task-runner-1] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z output to: var/druid/task/coordinator-issued_compact_telemetry-signals_gmohdmie_2022-10-30T20:00:32.744Z/log
2022-10-31T07:26:46,303 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: /druid/data/indexing-logs/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd.log
2022-10-31T07:26:46,303 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task report to: /druid/data/indexing-logs/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd.report.json
2022-10-31T07:26:46,303 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited successfully for task: index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd
2022-10-31T07:26:46,316 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: var/druid/task/index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd
2022-10-31T07:26:46,360 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task [index_kafka_telemetry-signals_acf5149bd6e81ae_jjipfpmd] completed with status [SUCCESS].
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.jboss.netty.logging.Slf4JLoggerFactory
winsmith commented 1 year ago

Update: I just switched from discovery via druid-kubernetes-extension to discovery via Zookeeper and it works. I think I'll take the hit of running a few Zookeeper pods if that means this issue is not cropping up.

If anyone has any comments or tips on how to resolve the original issue, I'd love that, otherwise I'll close the issue in an few days.

kortatu commented 5 months ago

I'm having the same problem. After each complete execution of a Peon, it says this is the last log of MidlleManager: completed with status [SUCCESS] And after that any new task stays in PENDING status as if there were no workers available: This is the last full log in a while for MiddleManager druid-druid-cluster-middlemanagers 2024-02-22T12:47:13,806 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task [index_kafka_telemetry_bfa96517080dc71_dllebiic] completed with status [SUCCESS]

Afterwards I can see log lines like this in coordinator:

druid-druid-cluster-coordinators 2024-02-22T12:50:27,589 ERROR [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcherpeon] org.apache.druid.discovery.BaseNodeRoleWatcher - Noticed disappearance of unknown druid no de [http://10.244.6.147:8101] of role [peon].

druid-druid-cluster-coordinators 2024-02-22T12:50:28,072 ERROR [HttpServerInventoryView-2] org.apache.druid.server.coordination.ChangeRequestHttpSyncer - Sync failed for server[http://10.244.6.147:8100/] while [Handling response with co de[503], description[Service Unavailable]]. Failed [10] times in the last [219] seconds.. Try restarting the Druid process on server[http://10.244.6.147:8100/].: {exceptionType=org.apache.druid.java.util.common.ISE, exceptionMessage=Rec eived sync response [503], class=org.apache.druid.server.coordination.ChangeRequestHttpSyncer} (org.apache.druid.java.util.common.ISE: Received sync response [503])

The middleManager pod status is healthy, so the pod won't restart on their own. If I restart manually the container or kill the pod (and a new one is created), after some seconds the task is moved from PENDING to RUNNING.

I'm using version 29.0.0 of the containers in AKS. I'm also using druid-kubernetes-extension, and I'm thinking in switching to Zookeeper to test if the problem is there.