Closed dang-stripe closed 1 year ago
I don't think it is possible to auto recover because there is no guarantee that all segments are already pushed when the exception happens. Ideally, if a previous metadata push task failed, rerun the same task again should get you the same result (idempotent). Does it work if you don't manually delete the segments? If not, we should fix it
Thanks for following up - we already fixed the segments by deleting them so can't easily reproduce atm. I looked at the code and it seems like we'd hit this block? https://github.com/apache/pinot/blob/release-0.9.0/pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java#L95 Which seems to try to refresh existing segments if there were any errors in the last upload.
It's a little unclear to me where the external view + ideal state would get updated here when the segment is refreshed. There also seems to be a check for enableParallelPushProtection
which defaults to false: https://github.com/apache/pinot/blob/release-0.9.0/pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentUploadDownloadRestletResource.java#L527 Is that something we'd want to enable to ensure refresh occurs?
Refreshing the existing segment is expected to keep the idempotency. When pushing a segment, if segment does not already exist, Pinot will create the ZK metadata first, then upload the ideal state; if segment already exists, Pinot will update the ZK metadata first, then send a REFRESH message. ParallelPushProtection
can be useful when multiple tasks try to push the same segment at the same time, but might not apply to this issue.
We should check if the metadata push task can correctly re-push all the segments to reach consistent state when a previous task failed in the middle.
@Jackie-Jiang Is it possible to make the metadata push job or API call to block until the segment has successfully been added to idealstate to avoid this issue? I think the error was unclear to us since our metadata push job didn't fail (presumably updating idealstate happens async). We could add more checks to our offline jobs to verify that the segment did make it into idealstate as a stop gap.
Yes, that is the current behavior. The API only return success when the segment ZK metadata is created/updated and the ideal state is updated/refresh message is sent. Which class do you use to perform the metadata push? We might want to check if the job itself fails when a push fails
We're using this class: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner
Based on the code, it seems like the job would correctly fail when a push fails. We no longer have the job logs around from the push, but I'll update this issue if this issue happens again.
This issue happened again, but we were able to get more logs this time. It seems like:
The batch job that performed the push seemed to succeed surprisingly. When we try to perform the upload again by re-running the push job, it does a no-op because it finds the segment already, but nothing updates the ideal state.
It seems like ideally either re-running the push would allow the ideal state to get updated? Or the controller should be able to repair this async so the ideal state gets updated?
[2022-10-07 01:49:04.335934] INFO [PinotSegmentUploadDownloadRestletResource:324] [jersey-server-managed-async-executor-6:224] HTTP Header: DOWNLOAD_URI is: s3://...
[2022-10-07 01:49:04.338695] INFO [PinotSegmentUploadDownloadRestletResource:246] [jersey-server-managed-async-executor-6:224] Uploading a segment table1_1_2022-09-24-21_2022-10-06-23_0 to table: table1, push type METADATA, (Derived from API parameter)
[2022-10-07 01:49:04.338807] INFO [PinotSegmentUploadDownloadRestletResource:267] [jersey-server-managed-async-executor-6:224] Processing upload request for segment: table1_1_2022-09-24-21_2022-10-06-23_0 of table: table1_OFFLINE from client: localhost, ingestion descriptor: null
[2022-10-07 01:49:04.338867] INFO [PinotSegmentUploadDownloadRestletResource:295] [jersey-server-managed-async-executor-6:224] Setting zkDownloadUri: to s3://... for segment: table1_1_2022-09-24-21_2022-10-06-23_0 of table: table1_OFFLINE, skipping move
[2022-10-07 01:49:04.340471] INFO [ZKOperator:76] [jersey-server-managed-async-executor-6:224] Adding new segment table1_1_2022-09-24-21_2022-10-06-23_0 from table table1_OFFLINE
[2022-10-07 01:49:04.340565] INFO [ZKOperator:240] [jersey-server-managed-async-executor-6:224] Skipping segment move, keeping segment table1_1_2022-09-24-21_2022-10-06-23_0 from table table1_OFFLINE at s3://...
[2022-10-07 01:49:04.646401] INFO [PinotHelixResourceManager:1799] [jersey-server-managed-async-executor-6:224] Added segment: table1_1_2022-09-24-21_2022-10-06-23_0 of table: table1_OFFLINE to property store
[2022-10-07 01:51:52.018327] ERROR [PinotHelixResourceManager:1834] [jersey-server-managed-async-executor-6:224] Caught exception while adding segment: table1_1_2022-09-24-21_2022-10-06-23_0 to IdealState for table: table1_OFFLINE, deleting segment ZK metadata
[2022-10-07 01:51:52.018343] java.lang.RuntimeException: Caught exception while updating ideal state for resource: table1_OFFLINE
[2022-10-07 01:51:52.018368] at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:160) ~[pinot-all-0.9.0-2022-09-15-148e6c38e-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-09-15-148e6c38e-SNAPSHOT-148e6c38ec4e797319fa1b66f0f212a4c2b2f171]
[2022-10-07 01:51:52.018392] at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:180) ~[pinot-all-0.9.0-2022-09-15-148e6c38e-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-09-15-148e6c38e-SNAPSHOT-148e6c38ec4e797319fa1b66f0f212a4c2b2f171]
[2022-10-07 01:51:52.018418] at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:1814) ~[pinot-all-0.9.0-2022-09-15-148e6c38e-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2022-09-15-148e6c38e-SNAPSHOT-148e6c38ec4e797319fa1b66f0f212a4c2b2f171]
Show all 46 lines
[2022-10-07 01:51:52.025769] WARN [ZkClient:1273] [jersey-server-managed-async-executor-6:224] Failed to delete path /pinot_cluster/PROPERTYSTORE/SEGMENTS/table1_OFFLINE/table1_1_2022-09-24-21_2022-10-06-23_0! java.lang.IllegalStateException: ZkClient already closed!
[2022-10-07 01:51:52.028874] ERROR [PinotSegmentUploadDownloadRestletResource:53] [jersey-server-managed-async-executor-6:224] Caught internal server exception while uploading segment [2022-10-07 01:51:52.028892] java.lang.IllegalStateException: ZkClient already closed!
Shutdown logs on the same host:
[2022-10-07 01:51:46.631185] INFO [PinotServiceManager:233] [Thread-1:30] Shutting down Pinot Service Manager with all running Pinot instances...
--
| [2022-10-07 01:51:51.679919] INFO [HelixTaskExecutor:1148] [Thread-1:30] Shutting down HelixTaskExecutor
| [2022-10-07 01:51:51.680457] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@6aed4a23[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
| [2022-10-07 01:51:51.680633] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@404566e7[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
| [2022-10-07 01:51:51.680777] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@2559f65c[Running, pool size = 40, active threads = 0, queued tasks = 0, completed tasks = 56]
| [2022-10-07 01:51:51.692292] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@7d5cbcc9[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
| [2022-10-07 01:51:51.821168] INFO [BaseControllerStarter:718] [Thread-1:30] Shutting down executor service
| [2022-10-07 01:51:51.821300] INFO [HelixTaskExecutor:1148] [Thread-1:30] Shutting down HelixTaskExecutor
| [2022-10-07 01:51:51.821982] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@4f482051[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
| [2022-10-07 01:51:51.822040] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@294804bd[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
| [2022-10-07 01:51:51.822095] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@51573738[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
| [2022-10-07 01:51:51.822145] INFO [HelixTaskExecutor:535] [Thread-1:30] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@2474df51[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
| [2022-10-07 01:52:04.331357] INFO [GenericHelixController:1082] [Thread-1:30] Shutting down DEFAULT pipeline
| [2022-10-07 01:52:05.013192] INFO [GenericHelixController:1085] [Thread-1:30] Shutting down TASK pipeline
| [2022-10-07 01:52:05.127604] INFO [BaseControllerStarter:679] [Thread-1:30] Finish shutting down Pinot controller for Controller_pinot-controller-fw-3.service.consul_9000
| [2022-10-07 01:52:05.128153] INFO [PinotServiceManager:224] [Thread-1:30] Shutting down Pinot Service Manager admin application...
Also possibly when the push job retries the request since it'll have received a 500, it'll get a 200 because the upload request finds the segments already exist, causing the batch job to succeed.
This is a bug. When a segment is uploaded, we update ZK in 2 steps: first create the ZK metadata for the segment, then add the segment to the ideal state. If the upload fails after the first step, the next upload should still go through and try to add the segment to the ideal state if it does not exist
We had a few segments that were not assigned to any servers due to metadata push happening when the controller was shutting down while it was getting replaced. The segments were not added to the ideal state yet.
After the controller was replaced, the segments remained unassigned to any servers. In order to fix the issue, we deleted the segments and re-ran the segment metadata push job. We were unable to reset the segments since they did not exist in ideal state yet
Is it possible to avoid this race condition or make the segments recover themselves after startup?