Concurrent Offline Table Segment Uploads Can Lead to Error State

For two of our use-cases we started seeing weird segment in error state issues recently and on debugging we found that it is because of the fact that uploading offline table segments concurrently across different controllers is not safe.

I won't go into the full root-cause but will add some notes:

There's a in-memory lock taken to update ideal state in the segment upload path triggered by a upload to POST /segments API. So concurrently uploading segments via the same controller should be fine.
Issue is more likely to be hit as you increase concurrency or IdealState size.
The bad segments were caused because the segment metadata was deleted but the servers had already started the OFFLINE ==> ONLINE transition.
Recovering from a bad state is hard and we had to delete segments and re-upload them to fix the situation.

This exception was seen in the server:

Caught exception in state transition from OFFLINE -> ONLINE for resource: <table-name>, partition: <segment-name>"}
java.lang.NullPointerException: null
        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:882)
        at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:401)
        at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineS
tateModelFactory.java:163)

And this was seen in the controller:

java.lang.RuntimeException: Caught exception while updating ideal state for resource: <table-name>
        at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:169)
        at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:193)
        at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:2137)
        at org.apache.pinot.controller.api.upload.ZKOperator.processNewSegment(ZKOperator.java:294)
        at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:82)
        at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:360)
        at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegmentAsJson(PinotSegmentUploadDownloadRestletResource.java:481)
        at jdk.internal.reflect.GeneratedMethodAccessor343.invoke(Unknown Source)
        ...
Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 20 attempts
        at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
        at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:98)

The easiest solution to this problem is to use a single controller for concurrent uploads or do sequential uploads in the offline ingestion pipeline which is what we will be doing. Creating this ticket if someone is interested in doing a native fix for this.

apache / pinot

Concurrent Offline Table Segment Uploads Can Lead to Error State #11636