apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.28k stars 1.23k forks source link

should SegmentDeletionManager expose the 2 public APIs? #8092

Open walterddr opened 2 years ago

walterddr commented 2 years ago

Currently SegmentDeletionManager has

  1. a deleteSegments API that allows users to delete segments from property store and from deep store.
  2. a removeSegmentsFromStore API that allows users to only delete from deep store.

This creates confusion regarding failure recovery, when the component service dies in the middle of the execution, they can either result in

Which one should we go with as failure recovery strategy? From the point that we have a retention manager that periodically checks deep store for files to delete, I think we should always first delete from property store and next delete from deep store. thoughts?

Jackie-Jiang commented 2 years ago

Good point. The retention manager doesn't really check the deep store file, but scan the ZK metadata, so I think we should first delete from deep store, then remove the ZK metadata as the last step so that the retention manager can always find the undeleted segments

walterddr commented 2 years ago

Exactly what I thought as well @Jackie-Jiang ^ implementing a fix

walterddr commented 2 years ago

some caveat for our current implementation

  1. table deletion requires a "sync" deletion on all segments on deep store before table config were deleted. this means if there's any failure before the table is completely deleted, error might've been returned and it is up to the client to handle and reissue deletion.
  2. segment deletion however requires us to first delete the property stores and asynchronously delete the segment files on deep store. since the return is suppose to be quick.

Ideally speaking what we needed here decouple the deep store deletion from the table/segment deletion. -> for table/segment deletion, as long as their ideal state is being removed from ZK. we consider them to be deleted. -> for deep store data, utilizing RetentionManager to do the clean up.

however, this causes problems if table / segments with the exact same name is re-created (for example if one were to replace a corrupted segment with a newly ingested one). So challenge here is resolve operations similar to this one that are ultimately requires a "sync" deletion across zk and deep store.

solution 1: have some sort of tombstone mechanism to indicate that the table/segment is marked as deleted, but do not recreate the same identifier again until the clean up has been completely done. solution 2: make versioning on table/segment that always increment version number when a new table / segment is created. solution 3: ???