Remove dependence of strict naming convention for realtime segments

npawar commented 3 years ago

The realtime segments are named as "tableNamepartitionIdsequenceNum__timestamp". This convention makes it harder to perform operations such as table migration, adhoc uploads, etc. In other intiatives such as pluggable streams, we have encountered issues with the presence of timestamp in segment name.

It would be nice to get to a state where we dont depend on the segment name having a convention as such. For starters, we can make this change for completed segments, and then trickle it down to consuming as well.

cc @mayankshriv

mayankshriv commented 3 years ago

+1, especially the getTableName() api that gets table name from segmentName makes it very rigid. In the offline side, we have decoupled table name from the segment metadata as well.

mcvsubbu commented 3 years ago

It may take a few releases, but yes, I am supportive of this.

I think the root is perhaps in the segment completion protocol, where table name can be sent as an additional argument as opposed to deriving it from the segment name. Once we introduce a new protocol element, then we can cut an incompat release where the controller expects the server to send this new protocol element. I suggest checking this in soon, and have the controller changes to get the table name from the protocol element if it is available, or else from the segment name. Over one or two releases, we can eliminate the latter.

Another part of the naming convention is the double underscore part. This is a harder thing to remove, IMO. We need the sequence number and partition ID for sure. The timestamp, while being optional, has come in super handy for debugging in production environments, and also while simply looking at the zookeeper via a client.

apache / pinot

Remove dependence of strict naming convention for realtime segments #7594