apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.52k stars 1.29k forks source link

Stuck unassigned segments after metadata push while controller is shutting down #8948

Closed dang-stripe closed 1 year ago

dang-stripe commented 2 years ago

We had a few segments that were not assigned to any servers due to metadata push happening when the controller was shutting down while it was getting replaced. The segments were not added to the ideal state yet.

After the controller was replaced, the segments remained unassigned to any servers. In order to fix the issue, we deleted the segments and re-ran the segment metadata push job. We were unable to reset the segments since they did not exist in ideal state yet

Is it possible to avoid this race condition or make the segments recover themselves after startup?


[2022-06-18 07:41:14.084079] 2022/06/18 07:41:14.083 INFO [BaseControllerStarter] [Thread-1] Shutting down executor service
[2022-06-18 07:41:14.084149] 2022/06/18 07:41:14.084 INFO [BaseControllerStarter] [Thread-1] Disconnecting helix controller zk manager
...
[2022-06-18 07:41:14.088673] 2022/06/18 07:41:14.085 ERROR [PinotHelixResourceManager] [jersey-server-managed-async-executor-46] Caught exception while adding segment: table1_65_2022-06-17-18_2022-06-18-05_0 to IdealState for table: table1_OFFLINE, deleting segment ZK metadata
[2022-06-18 07:41:14.088707] java.lang.RuntimeException: Caught exception while updating ideal state for resource: table1_OFFLINE
[2022-06-18 07:41:14.088739]    at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:160) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.088775]    at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:180) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.088807]    at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:1814) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.088861]    at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.addNewSegment(PinotHelixResourceManager.java:1800) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.088895]    at org.apache.pinot.controller.api.upload.ZKOperator.processNewSegment(ZKOperator.java:243) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.088926]    at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:77) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.088979]    at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.completeZkOperations(PinotSegmentUploadDownloadRestletResource.java:401) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089031]    at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:303) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089081]    at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegmentAsMultiPartV2(PinotSegmentUploadDownloadRestletResource.java:533) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089104]    at jdk.internal.reflect.GeneratedMethodAccessor202.invoke(Unknown Source) ~[?:?]
[2022-06-18 07:41:14.089129]    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
[2022-06-18 07:41:14.089148]    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
[2022-06-18 07:41:14.089172]    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089221]    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089272]    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089328]    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089376]    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089425]    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
[2022-06-18 07:41:14.089472]    at org.glassfish.jersey.server.model.ResourceMethodInvoker.lambda$apply$0(ResourceMethodInvoker.java:381) ~[pinot-all-0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-06-15-4b2ba75da-SNAPSHOT-4b2ba75da26eb821937444b097465d63686d5cc3]
Jackie-Jiang commented 2 years ago

I don't think it is possible to auto recover because there is no guarantee that all segments are already pushed when the exception happens. Ideally, if a previous metadata push task failed, rerun the same task again should get you the same result (idempotent). Does it work if you don't manually delete the segments? If not, we should fix it

dang-stripe commented 2 years ago

Thanks for following up - we already fixed the segments by deleting them so can't easily reproduce atm. I looked at the code and it seems like we'd hit this block? https://github.com/apache/pinot/blob/release-0.9.0/pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java#L95 Which seems to try to refresh existing segments if there were any errors in the last upload.

It's a little unclear to me where the external view + ideal state would get updated here when the segment is refreshed. There also seems to be a check for enableParallelPushProtection which defaults to false: https://github.com/apache/pinot/blob/release-0.9.0/pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentUploadDownloadRestletResource.java#L527 Is that something we'd want to enable to ensure refresh occurs?

Jackie-Jiang commented 2 years ago

Refreshing the existing segment is expected to keep the idempotency. When pushing a segment, if segment does not already exist, Pinot will create the ZK metadata first, then upload the ideal state; if segment already exists, Pinot will update the ZK metadata first, then send a REFRESH message. ParallelPushProtection can be useful when multiple tasks try to push the same segment at the same time, but might not apply to this issue. We should check if the metadata push task can correctly re-push all the segments to reach consistent state when a previous task failed in the middle.

dang-stripe commented 2 years ago

@Jackie-Jiang Is it possible to make the metadata push job or API call to block until the segment has successfully been added to idealstate to avoid this issue? I think the error was unclear to us since our metadata push job didn't fail (presumably updating idealstate happens async). We could add more checks to our offline jobs to verify that the segment did make it into idealstate as a stop gap.

Jackie-Jiang commented 2 years ago

Yes, that is the current behavior. The API only return success when the segment ZK metadata is created/updated and the ideal state is updated/refresh message is sent. Which class do you use to perform the metadata push? We might want to check if the job itself fails when a push fails

dang-stripe commented 2 years ago

We're using this class: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner

Based on the code, it seems like the job would correctly fail when a push fails. We no longer have the job logs around from the push, but I'll update this issue if this issue happens again.

dang-stripe commented 2 years ago

This issue happened again, but we were able to get more logs this time. It seems like:

  1. Job sends segment upload request controller
  2. Controller processes upload, adds segment to property store
  3. Controller starts shutting down including ZK conn
  4. Controller fails to update ideal state and delete segment from property store due to ZK conn closed
  5. Upload request fails

The batch job that performed the push seemed to succeed surprisingly. When we try to perform the upload again by re-running the push job, it does a no-op because it finds the segment already, but nothing updates the ideal state.

It seems like ideally either re-running the push would allow the ideal state to get updated? Or the controller should be able to repair this async so the ideal state gets updated?

[2022-10-07 01:49:04.335934] INFO [PinotSegmentUploadDownloadRestletResource:324] [jersey-server-managed-async-executor-6:224] HTTP Header: DOWNLOAD_URI is: s3://...
[2022-10-07 01:49:04.338695] INFO [PinotSegmentUploadDownloadRestletResource:246] [jersey-server-managed-async-executor-6:224] Uploading a segment table1_1_2022-09-24-21_2022-10-06-23_0 to table: table1, push type METADATA, (Derived from API parameter)
[2022-10-07 01:49:04.338807] INFO [PinotSegmentUploadDownloadRestletResource:267] [jersey-server-managed-async-executor-6:224] Processing upload request for segment: table1_1_2022-09-24-21_2022-10-06-23_0 of table: table1_OFFLINE from client: localhost, ingestion descriptor: null
[2022-10-07 01:49:04.338867] INFO [PinotSegmentUploadDownloadRestletResource:295] [jersey-server-managed-async-executor-6:224] Setting zkDownloadUri: to s3://... for segment: table1_1_2022-09-24-21_2022-10-06-23_0 of table: table1_OFFLINE, skipping move
[2022-10-07 01:49:04.340471] INFO [ZKOperator:76] [jersey-server-managed-async-executor-6:224] Adding new segment table1_1_2022-09-24-21_2022-10-06-23_0 from table table1_OFFLINE
[2022-10-07 01:49:04.340565] INFO [ZKOperator:240] [jersey-server-managed-async-executor-6:224] Skipping segment move, keeping segment table1_1_2022-09-24-21_2022-10-06-23_0 from table table1_OFFLINE at s3://...
[2022-10-07 01:49:04.646401] INFO [PinotHelixResourceManager:1799] [jersey-server-managed-async-executor-6:224] Added segment: table1_1_2022-09-24-21_2022-10-06-23_0 of table: table1_OFFLINE to property store
[2022-10-07 01:51:52.018327] ERROR [PinotHelixResourceManager:1834] [jersey-server-managed-async-executor-6:224] Caught exception while adding segment: table1_1_2022-09-24-21_2022-10-06-23_0 to IdealState for table: table1_OFFLINE, deleting segment ZK metadata
[2022-10-07 01:51:52.018343] java.lang.RuntimeException: Caught exception while updating ideal state for resource: table1_OFFLINE
[2022-10-07 01:51:52.018368]    at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:160) ~[pinot-all-0.9.0-2022-09-15-148e6c38e-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-09-15-148e6c38e-SNAPSHOT-148e6c38ec4e797319fa1b66f0f212a4c2b2f171]
[2022-10-07 01:51:52.018392]    at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:180) ~[pinot-all-0.9.0-2022-09-15-148e6c38e-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-09-15-148e6c38e-SNAPSHOT-148e6c38ec4e797319fa1b66f0f212a4c2b2f171]
[2022-10-07 01:51:52.018418]    at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:1814) ~[pinot-all-0.9.0-2022-09-15-148e6c38e-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-09-15-148e6c38e-SNAPSHOT-148e6c38ec4e797319fa1b66f0f212a4c2b2f171]
Show all 46 lines
[2022-10-07 01:51:52.025769] WARN [ZkClient:1273] [jersey-server-managed-async-executor-6:224] Failed to delete path /pinot_cluster/PROPERTYSTORE/SEGMENTS/table1_OFFLINE/table1_1_2022-09-24-21_2022-10-06-23_0! java.lang.IllegalStateException: ZkClient already closed!
[2022-10-07 01:51:52.028874] ERROR [PinotSegmentUploadDownloadRestletResource:53] [jersey-server-managed-async-executor-6:224] Caught internal server exception while uploading segment [2022-10-07 01:51:52.028892] java.lang.IllegalStateException: ZkClient already closed!

Shutdown logs on the same host:

[2022-10-07 01:51:46.631185] INFO [PinotServiceManager:233] [Thread-1:30] Shutting down Pinot Service Manager with all running Pinot instances...
--
  | [2022-10-07 01:51:51.679919] INFO [HelixTaskExecutor:1148] [Thread-1:30] Shutting down HelixTaskExecutor
  | [2022-10-07 01:51:51.680457] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@6aed4a23[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
  | [2022-10-07 01:51:51.680633] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@404566e7[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
  | [2022-10-07 01:51:51.680777] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@2559f65c[Running, pool size = 40, active threads = 0, queued tasks = 0, completed tasks = 56]
  | [2022-10-07 01:51:51.692292] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@7d5cbcc9[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
  | [2022-10-07 01:51:51.821168] INFO [BaseControllerStarter:718] [Thread-1:30] Shutting down executor service
  | [2022-10-07 01:51:51.821300] INFO [HelixTaskExecutor:1148] [Thread-1:30] Shutting down HelixTaskExecutor
  | [2022-10-07 01:51:51.821982] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@4f482051[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
  | [2022-10-07 01:51:51.822040] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@294804bd[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
  | [2022-10-07 01:51:51.822095] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@51573738[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
  | [2022-10-07 01:51:51.822145] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@2474df51[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
  | [2022-10-07 01:52:04.331357] INFO [GenericHelixController:1082] [Thread-1:30] Shutting down DEFAULT pipeline
  | [2022-10-07 01:52:05.013192] INFO [GenericHelixController:1085] [Thread-1:30] Shutting down TASK pipeline
  | [2022-10-07 01:52:05.127604] INFO [BaseControllerStarter:679] [Thread-1:30] Finish shutting down Pinot controller for Controller_pinot-controller-fw-3.service.consul_9000
  | [2022-10-07 01:52:05.128153] INFO [PinotServiceManager:224] [Thread-1:30] Shutting down Pinot Service Manager admin application...
dang-stripe commented 2 years ago

Also possibly when the push job retries the request since it'll have received a 500, it'll get a 200 because the upload request finds the segments already exist, causing the batch job to succeed.

Jackie-Jiang commented 2 years ago

This is a bug. When a segment is uploaded, we update ZK in 2 steps: first create the ZK metadata for the segment, then add the segment to the ideal state. If the upload fails after the first step, the next upload should still go through and try to add the segment to the ideal state if it does not exist

Jackie-Jiang commented 1 year ago

9631 fixes the bug mentioned above