apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.51k stars 1.29k forks source link

Minion task for hybrid table erroring about a retry attempt exhaustion #12547

Closed estebanz01 closed 7 months ago

estebanz01 commented 8 months ago

Hola! 👋

I'm having some trouble with a pinot cluster deployed into kubernetes with minion enabled. I want to move data from real time table to offline table but it's failing with the following information:

^T[16:18:00.383 [TaskStateModelFactory-task_thread-7] ERROR org.apache.pinot.minion.taskfactory.TaskFactoryRegistry - Caught exception while executing task: Task_RealtimeToOfflineSegmentsTask_8961b037-3c41-47d7-b56f-375ef16dc2fc_1709569080105_0
org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 1 attempts
    at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.common.utils.fetcher.HttpSegmentFetcher.fetchSegmentToLocal(HttpSegmentFetcher.java:62) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocalInternal(SegmentFetcherFactory.java:158) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocal(SegmentFetcherFactory.java:152) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocalInternal(SegmentFetcherFactory.java:202) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocal(SegmentFetcherFactory.java:190) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.plugin.minion.tasks.BaseMultipleSegmentsConversionExecutor.executeTask(BaseMultipleSegmentsConversionExecutor.java:201) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.plugin.minion.tasks.BaseMultipleSegmentsConversionExecutor.executeTask(BaseMultipleSegmentsConversionExecutor.java:77) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.minion.taskfactory.TaskFactoryRegistry$1.runInternal(TaskFactoryRegistry.java:157) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.pinot.minion.taskfactory.TaskFactoryRegistry$1.run(TaskFactoryRegistry.java:118) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at org.apache.helix.task.TaskRunner.run(TaskRunner.java:75) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:829) [?:?]

That's the only error I see in the minion pods and there's nothing else on the other pods for pinot. Any ideas on how to debug this further? here's the schema and table config for my hybrid table:

Schema definition ```json { "schemaName": "data_counting", "dimensionFieldSpecs": [ { "name": "device_name", "dataType": "STRING" } ], "metricFieldSpecs": [ { "name": "bytes_sent", "dataType": "LONG" } ], "dateTimeFieldSpecs": [ { "name": "__key", "dataType": "TIMESTAMP", "format": "1:MICROSECONDS:EPOCH", "granularity": "1:MICROSECONDS" }, { "name": "__metadata$eventTime", "dataType": "TIMESTAMP", "format": "1:MICROSECONDS:EPOCH", "granularity": "1:MICROSECONDS" } ] } ```
Table configuration (REALTIME) ```json { "REALTIME": { "tableName": "data_counting_REALTIME", "tableType": "REALTIME", "segmentsConfig": { "schemaName": "data_counting", "replication": "1", "retentionTimeUnit": "DAYS", "retentionTimeValue": "15", "replicasPerPartition": "1", "minimizeDataMovement": false, "timeColumnName": "__key" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant", "tagOverrideConfig": {} }, "tableIndexConfig": { "invertedIndexColumns": [], "noDictionaryColumns": [], "streamConfigs": { "streamType": "pulsar", "stream.pulsar.topic.name": "persistent://client/devices/all", "stream.pulsar.bootstrap.servers": "pulsar://pulsar-proxy.pulsar.svc.cluster.local:6650", "stream.pulsar.prop.auto.offset.reset": "smallest", "stream.pulsar.consumer.type": "lowlevel", "stream.pulsar.fetch.timeout.millis": "20000", "stream.pulsar.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder", "stream.pulsar.consumer.factory.class.name": "org.apache.pinot.plugin.stream.pulsar.PulsarConsumerFactory", "realtime.segment.flush.threshold.rows": "10000", "realtime.segment.flush.threshold.time": "1h", "stream.pulsar.metada.populate": "true", "stream.pulsar.metadata.fields": "eventTime" }, "loadMode": "MMAP", "onHeapDictionaryColumns": [], "varLengthDictionaryColumns": [], "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false, "rangeIndexColumns": [], "rangeIndexVersion": 2, "optimizeDictionary": false, "optimizeDictionaryForMetrics": false, "noDictionarySizeRatioThreshold": 0.85, "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "sortedColumn": [], "bloomFilterColumns": [] }, "metadata": {}, "quota": {}, "task": { "taskTypeConfigsMap": { "RealtimeToOfflineSegmentsTask": { "bucketTimePeriod": "1h", "bufferTimePeriod": "2h", "mergeType": "concat", "maxNumRecordsPerSegment": "100000", "schedule": "0 * * * * ?" } } }, "routing": {}, "query": { "timeoutMs": 60000 }, "ingestionConfig": { "continueOnError": false, "rowTimeValueCheck": false, "segmentTimeValueCheck": true }, "isDimTable": false } } ```
Table configuration (OFFLINE) ```json { "OFFLINE": { "tableName": "data_counting_OFFLINE", "tableType": "OFFLINE", "segmentsConfig": { "schemaName": "data_counting", "replication": "1", "replicasPerPartition": "1", "timeColumnName": "__key", "minimizeDataMovement": false, "segmentPushType": "APPEND", "segmentPushFrequency": "HOURLY" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [], "noDictionaryColumns": [], "rangeIndexColumns": [], "rangeIndexVersion": 2, "createInvertedIndexDuringSegmentGeneration": false, "autoGeneratedInvertedIndex": false, "sortedColumn": [], "bloomFilterColumns": [], "loadMode": "MMAP", "onHeapDictionaryColumns": [], "varLengthDictionaryColumns": [], "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false, "optimizeDictionary": false, "optimizeDictionaryForMetrics": false, "noDictionarySizeRatioThreshold": 0.85 }, "metadata": {}, "quota": {}, "routing": {}, "query": {}, "ingestionConfig": { "continueOnError": false, "rowTimeValueCheck": false, "segmentTimeValueCheck": true }, "isDimTable": false } } ```

I'm using apache pulsar 3.2.0 and apache pinot version 1.0.0.

estebanz01 commented 8 months ago

additional information from the pinot controller:

java.lang.IllegalStateException: Failed to move segment file for segment ```java pinot-controller-2 controller 20:09:56.414 [grizzly-http-server-0] ERROR SegmentCompletionFSM_data_counting__0__3__20240304T2005Z - Caught exception while committing segment file for segment: data_counting__0__3__20240304T2005Z pinot-controller-2 controller java.lang.IllegalStateException: Failed to move segment file for segment: data_counting_temp__0__3__20240304T2005Z from: file:/var/pinot/controller/data,s3:///pinot-data/pinot/controller-data/data_counting/data_counting__0__3__20240304T2005Z.tmp.7b041191-d8d0-4df9-bf97-543ca6c0a407 to: file:/var/pinot/controller/data,s3:///pinot-data/pinot/controller-data/data_counting/data_counting__0__3__20240304T2005Z pinot-controller-2 controller at org.apache.pinot.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:854) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.moveSegmentFile(PinotLLCRealtimeSegmentManager.java:1580) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.commitSegmentFile(PinotLLCRealtimeSegmentManager.java:489) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.apache.pinot.controller.helix.core.realtime.SegmentCompletionManager$SegmentCompletionFSM.commitSegment(SegmentCompletionManager.java:1085) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.apache.pinot.controller.helix.core.realtime.SegmentCompletionManager$SegmentCompletionFSM.segmentCommitEnd(SegmentCompletionManager.java:660) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.apache.pinot.controller.helix.core.realtime.SegmentCompletionManager.segmentCommitEnd(SegmentCompletionManager.java:326) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.apache.pinot.controller.api.resources.LLCSegmentCompletionHandlers.segmentCommitEndWithMetadata(LLCSegmentCompletionHandlers.java:444) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at jdk.internal.reflect.GeneratedMethodAccessor333.invoke(Unknown Source) ~[?:?] pinot-controller-2 controller at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] pinot-controller-2 controller at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] pinot-controller-2 controller at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:200) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] pinot-controller-2 controller at java.lang.Thread.run(Thread.java:829) [?:?] ```
Jackie-Jiang commented 8 months ago

Can you also check the WARN log from minion?

@snleee @swaminathanmanish Please help take a look

estebanz01 commented 8 months ago

Sure thing. All warn information goes to the same output, right ? or it's there another specific location I can look into.

estebanz01 commented 8 months ago

here's the full log of a fresh minion pod:

minion pod log ```java SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/pinot/lib/pinot-all-1.0.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-environment/pinot-azure/pinot-azure-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-input-format/pinot-clp-log/pinot-clp-log-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-input-format/pinot-orc/pinot-orc-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-dropwizard/pinot-dropwizard-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-yammer/pinot-yammer-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-stream-ingestion/pinot-pulsar/pinot-pulsar-1.0.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. ERROR StatusLogger Reconfiguration failed: No configuration found for 'Default' at 'null' in 'null' WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/opt/pinot/lib/pinot-all-1.0.0-jar-with-dependencies.jar) to method java.lang.Object.finalize() WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Mar 06, 2024 1:24:34 PM org.glassfish.grizzly.http.server.NetworkListener start INFO: Started listener bound to [0.0.0.0:9514] Mar 06, 2024 1:24:34 PM org.glassfish.grizzly.http.server.HttpServer start INFO: [HttpServer] Started. 13:25:18.815 [TaskStateModelFactory-task_thread-0] ERROR org.apache.pinot.minion.taskfactory.TaskFactoryRegistry - Caught exception while executing task: Task_RealtimeToOfflineSegmentsTask_ddf2fd57-ed8f-4ee8-8c04-1e21137ed566_1709731500049_0 org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 1 attempts at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.common.utils.fetcher.HttpSegmentFetcher.fetchSegmentToLocal(HttpSegmentFetcher.java:62) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocalInternal(SegmentFetcherFactory.java:158) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocal(SegmentFetcherFactory.java:152) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocalInternal(SegmentFetcherFactory.java:202) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocal(SegmentFetcherFactory.java:190) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.plugin.minion.tasks.BaseMultipleSegmentsConversionExecutor.executeTask(BaseMultipleSegmentsConversionExecutor.java:201) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.plugin.minion.tasks.BaseMultipleSegmentsConversionExecutor.executeTask(BaseMultipleSegmentsConversionExecutor.java:77) ~[pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.minion.taskfactory.TaskFactoryRegistry$1.runInternal(TaskFactoryRegistry.java:157) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.pinot.minion.taskfactory.TaskFactoryRegistry$1.run(TaskFactoryRegistry.java:118) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at org.apache.helix.task.TaskRunner.run(TaskRunner.java:75) [pinot-all-1.0.0-jar-with-dependencies.jar:1.0.0-b6bdf6c9686b286a149d2d1aea4a385ee98f3e79] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] ```

and the pod description:

Minion pod description ```bash Name: pinot-minion-0 Namespace: pinot Priority: 0 Service Account: pinot Node: / Start Time: Wed, 06 Mar 2024 08:23:42 -0500 Labels: app=pinot app.kubernetes.io/managed-by=Helm app.kubernetes.io/version=0.2.7 component=minion controller-revision-hash=pinot-minion-84bbfbc6f4 helm.sh/chart=pinot-0.2.7 heritage=Helm release=pinot statefulset.kubernetes.io/pod-name=pinot-minion-0 Annotations: kubectl.kubernetes.io/restartedAt: 2024-03-06T15:16:59+00:00 kubernetes.io/psp: eks.privileged Status: Running IP: IPs: IP: Controlled By: StatefulSet/pinot-minion Containers: minion: Container ID: containerd://e2cfae774017937ba2aa4f217d5f84a20809e4961c8920a82165bed4e290d2bf Image: apachepinot/pinot:release-1.0.0 Image ID: docker.io/apachepinot/pinot@sha256:ef93c03cb223a30e2a0eb75452dfb2db1eab05271a59e2913845bff9814556bc Port: 9514/TCP Host Port: 0/TCP Args: StartMinion -clusterName pinot -zkAddress pinot-zookeeper:2181 -configFileName /var/pinot/minion/config/pinot-minion.conf State: Running Started: Wed, 06 Mar 2024 08:23:49 -0500 Ready: True Restart Count: 0 Limits: cpu: 400m memory: 1Gi Requests: cpu: 200m memory: 512Mi Liveness: http-get http://:9514/health delay=60s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:9514/health delay=60s timeout=1s period=10s #success=1 #failure=3 Environment Variables from: s3-deep-storage-user Secret Optional: false Environment: JAVA_OPTS: -XX:ActiveProcessorCount=2 -XX:MaxRAMPercentage=70.0 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-minion.log -Dlog4j2.configurationFile=/opt/pinot/etc/config/pinot-minion-log4j2.xml -Dplugins.dir=/opt/pinot/plugins LOG4J_CONSOLE_LEVEL: info Mounts: /var/pinot/minion/config from config (rw) /var/pinot/minion/data from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xhbhh (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: data-pinot-minion-0 ReadOnly: false config: Type: ConfigMap (a volume populated by a ConfigMap) Name: pinot-minion-config Optional: false kube-api-access-xhbhh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15m default-scheduler Successfully assigned pinot/pinot-minion-0 to Normal Pulled 15m kubelet Container image "apachepinot/pinot:release-1.0.0" already present on machine Normal Created 15m kubelet Created container minion Normal Started 15m kubelet Started container minion ```
estebanz01 commented 7 months ago

OK, so while reading #12458 I noticed that my S3 bucket were empty and I found surprising that minions need S3 to work, so I went to look at the configuration for the controller and I found out that if I specify the property controller.data.dir twice, it merges both values instead of overriding them 🙃 so now I have data in my S3 bucket, but now the controller is giving the following error:

pinot-controller-0 controller INFO: [HttpServer] Started.
pinot-controller-0 controller 17:43:02.118 [grizzly-http-server-3] ERROR org.apache.pinot.controller.util.CompletionServiceHelper - Server: null returned error: 404
pinot-controller-0 controller 17:43:02.123 [grizzly-http-server-3] ERROR org.apache.pinot.controller.util.CompletionServiceHelper - Server: null returned error: 404
pinot-controller-0 controller 17:43:02.125 [grizzly-http-server-3] ERROR org.apache.pinot.controller.util.CompletionServiceHelper - Connection error. Details: java.net.UnknownHostException: Controller_null: Name or service not known
pinot-controller-0 controller 17:56:00.608 [grizzly-http-server-3] ERROR org.apache.pinot.controller.util.CompletionServiceHelper - Server: null returned error: 404
pinot-controller-0 controller 17:56:00.610 [grizzly-http-server-3] ERROR org.apache.pinot.controller.util.CompletionServiceHelper - Server: null returned error: 404
pinot-controller-0 controller 17:56:00.611 [grizzly-http-server-3] ERROR org.apache.pinot.controller.util.CompletionServiceHelper - Connection error. Details: java.net.UnknownHostException: Controller_null: Name or service not known

here's the task information, according to the UI:

Task config: ```json { "tableName": "data_counting_REALTIME", "configs": { "maxNumRecordsPerSegment": "100000", "mergeType": "rollup", "downloadURL": "http://pinot-controller:9000/segments/data_counting/data_counting__0__50__20240306T0440Z", "bufferTimePeriod": "2h", "push.mode": "TAR", "windowStartMs": "1709730000000", "segmentName": "data_counting__0__50__20240306T0440Z", "tableName": "data_counting_REALTIME", "collectorType": "rollup", "schedule": "0 0/5 * * * ?", "uploadURL": "http://pinot-controller:9000/segments", "push.controllerUri": "http://pinot-controller:9000", "__key.aggregationType": "min", "bucketTimePeriod": "1h", "windowEndMs": "1709733600000", "TASK_ID": "Task_RealtimeToOfflineSegmentsTask_4e81b60e-021b-4ba7-8b4c-03fd8f968d1b_1711033800254_0" }, "taskId": "Task_RealtimeToOfflineSegmentsTask_4e81b60e-021b-4ba7-8b4c-03fd8f968d1b_1711033800254_0", "taskType": "RealtimeToOfflineSegmentsTask" } ```

and here's the configmap that the pinot controller pods are using:

Name:         pinot-controller-config
Namespace:    pinot
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: pinot
              meta.helm.sh/release-namespace: pinot

Data
====
pinot-controller.conf:
----
controller.helix.cluster.name=pinot
controller.port=9000
controller.vip.host=pinot-controller
controller.vip.port=9000
controller.data.dir=s3://<bucket-name>/pinot-data/pinot/controller-data
controller.zk.str=pinot-zookeeper:2181
pinot.set.instance.id.to.hostname=true
controller.task.scheduler.enabled=true
controller.local.temp.dir=/var/pinot/controller/data
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.controller.storage.factory.s3.region=eu-west-1
pinot.controller.segment.fetcher.protocols=file,http,s3
pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.controller.storage.factory.s3.disableAcl=false

BinaryData
====

Events:  <none>

from what I understand, the controller is trying to fetch segments from either a Null hostname or an invalid one. But the hosts are correct or appears to be correct.

Any ideas on how to make it work after this progress?

estebanz01 commented 7 months ago

btw, what does pinot.set.instance.id.to.hostname=true does and if I put it in false, how can I specify an alternative hostname?

Jackie-Jiang commented 7 months ago

You may look up the usage of CommonConstants.SET_INSTANCE_ID_TO_HOSTNAME_KEY from the code. You can use controller.host key to specify the host name

estebanz01 commented 7 months ago

OK, after lots of trial and error, this is what I did to have a working hybrid table with minion tasks and S3 deep storage:

Controller helm config ```yaml controller: # We make sure that only this configuration is present, as duplicated configs won't override but merge. data: dir: s3:////controller-data # If we don't specify the host and port, a `Controller_null_9000` controller will be seen by pinot. host: pinot-controller port: 9000 # Not sure why a `Controller_null_9000` will appear if we have `vip` enable, but oh well! vip: enable: true host: pinot-controller port: 9000 # ...other configs configs: |- pinot.set.instance.id.to.hostname=true controller.task.scheduler.enabled=true controller.local.temp.dir=/var/pinot/controller/data # Super important! data will be here until it's offloaded to S3 pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.region=us-east-1 pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher pinot.controller.storage.factory.s3.disableAcl=false ```
Minion helm config ```yaml minion: # ... other configs extra: configs: |- pinot.set.instance.id.to.hostname=true pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.minion.storage.factory.s3.region=us-east-1 pinot.minion.segment.fetcher.protocols=file,http,s3 pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher ```

Basically, we had to configure the S3 filesystem on controller, server and minion so the workers can fetch and upload/download data when needed. I'm not sure how with other deep storage options it might look like, but it seems that all three components must be in config-sync, if that makes sense.

Thanks for the help on this!