Proposal for a new instance-partition based routing strategy

tibrewalpratik17 commented 6 months ago

Current scenario

All routing strategies are instance-based strategy. In one of recent incidents in our cluster, we saw a lot of query failures because 2 instances were marked as unavailable (unqueryable) and they were serving as 2 common replicas for around 3000 segments. During query, we marked these 3000 segments as unavailable. Why these 2 instances were marked as unqueryable was because 2 different segments were OFFLINE in these 2 instances.

Example:

There were serverA, serverB and there were 3000 segments which were having both replicas in these 2 servers. Now segmentX was OFFLINE in serverA and segmentY was OFFLINE in serverB. We are using strictReplicaGroup strategy and so all 3000 segments became unavailable as serverA, serverB were not treated as eligible serving candidates anymore.

Possible prevention

Both segmentX and segmentY were from different partitions. If we would have accounted serverA to not serve queries only for partitions as that of segmentX and same serverB to not serve queries only for partitions of segmentY, we could have prevented this incident.

Note: if each server has only one partition then it makes sense to not allow it to serve queries completely.

Proposal

What we are proposing in this issue is to have a routing strategy like instancePartitionReplicaGroup (name can be discussed during implementation). Here, we will maintain a mapping of available instance-partitions and not just instances. If all segments of a partition is ONLINE in a given instance, then we enable it to serve query for that instance-partition.

At present, we use Ideal-state to find segments --> instances mapping. In this scenario, we can even fallback to instanceToPartitions info in ZK to fetch this info.

Note: this proposal only affects REALTIME tables and not OFFLINE tables.

cc @ankitsultana @shauryachats

ankitsultana commented 6 months ago

@tibrewalpratik17 : just to confirm, I think you mean that we should select a replica-group for each Kafka partition right?

That means if any segment of a kafka partition X are not available in replica-group 0, but are available in replica-group 1, then we will use rg=1 for X. For the same query, if any segment of a Kafka partition Y are not available in replica-group 1, but are available in replica-group 0, then we will use rg=0 for Y.

That would mean that a single query can span multiple replica-groups.

tibrewalpratik17 commented 6 months ago

just to confirm, I think you mean that we should select a replica-group for each Kafka partition right? That means if any segment of a kafka partition X are not available in replica-group 0, but are available in replica-group 1, then we will use rg=1 for X. For the same query, if any segment of a Kafka partition Y are not available in replica-group 1, but are available in replica-group 0, then we will use rg=0 for Y. That would mean that a single query can span multiple replica-groups.

Yes exactly! We should mark an entire instance as unavailable for all segments in that instance rather just the ones which are a part of the same partition number (X in your case) in that instance.

Jackie-Jiang commented 5 months ago

Great feature to add!

We need the following info to route the query: IS: Find ONLINE/CONSUMING segments to query EV: Actual segment states Segment name or ZK metadata: Which partition does the segment belongs to (we don't track this currently because we don't want to read ZK metadata so frequently, but we may consider caching it, similar to what we do in the partition pruning)

With the above info, we should be able to calculate which partition is available on which server, then route accordingly.

cc @klsince

klsince commented 5 months ago

This reminded me of an improvement I tried for strictReplicaGroup (https://github.com/apache/pinot/pull/11847) but didn't finish. As you mentioned in the issue description, it's a bit too rigid to skip the instance if any one of segments hosted on it was unavailable, as often we'd have to skip all instances, and reporting that a huge number of segments were unavailable (which was kinda misleading).

Basically the improvement I was trying to add was to pick an instance, even though it has unavailable segments and reported status of the unavailable segments back. The key abstraction in that PR was InstanceGroup, which caches the mapping from a set of instances to a set of segments on them. With replica group assignment, the set of instances should host the same set of segments, but some instances might have unavailable segments and some instance might be fine. The mapping info is updated whenever IS/EV gets updated. While selecting instances, InstanceGroup is used to quickly identify a instance.

The InstanceGroup in the PR simply tracks the segments in a Set, but we can group segments further by their partitions, then we may do server selection based on Instance-Partition as proposed here.

Just some quick thoughts based on the implementation I tried earlier on. I didn't get enough time to finish that improvement, but if it makes sense and could be reused by this feature, then I can try my best to get it. btw, feel free to comment that PR.

tibrewalpratik17 commented 5 months ago

This reminded me of an improvement I tried for strictReplicaGroup (https://github.com/apache/pinot/pull/11847) but didn't finish. As you mentioned in the issue description, it's a bit too rigid to skip the instance if any one of segments hosted on it was unavailable, as often we'd have to skip all instances, and reporting that a huge number of segments were unavailable (which was kinda misleading).

Yes this is the exact issue we faced. Very well summarised!

Basically the improvement I was trying to add was to pick an instance, even though it has unavailable segments and reported status of the unavailable segments back. The key abstraction in that PR was InstanceGroup, which caches the mapping from a set of instances to a set of segments on them. With replica group assignment, the set of instances should host the same set of segments, but some instances might have unavailable segments and some instance might be fine. The mapping info is updated whenever IS/EV gets updated. While selecting instances, InstanceGroup is used to quickly identify a instance.

Hmm, I think choosing an instance with unavailable segments is too relaxed in my opinion. This approach could be perceived as data loss by customers if, for example, they are looking for a UUID and the segment containing that UUID is unavailable. Although we return a list of unavailable segments, there's no observability that the unavailable segment(s) contains the UUID or not, which can lead to further confusion. It would be better to route to the available segment in the other replica, what do you think? However, if we handle this at the segment level, the query fanout might increase significantly. Therefore, we could maintain the granularity at the partition level. Alternatively, we could introduce another segment-level routing strategy. Note: The goal should be to route to a single replica group wherever possible for optimal performance.

The InstanceGroup in the PR simply tracks the segments in a Set, but we can group segments further by their partitions, then we may do server selection based on Instance-Partition as proposed here.

Can we keep the InstanceGroup mapping and the SegmentPartition mapping separate? When reassigning instances, there's no need to update the segment-to-partition information. Updating the segment-partition cache is only necessary during segment deletion or addition, so it's best to keep them separate. This approach allows us to easily reuse these mappings independently based on different routing strategies.

klsince commented 5 months ago

I think choosing an instance with unavailable segments is too relaxed in my opinion...

The improvement included a new query option useCompleteReplica to allow users to choose between data completeness vs. availability.

I'm +1 to support routing at partition level for better data completeness, when table is partitioned. Also makes sense to keep SegmentPartition separate then InstanceGroup can track partitionIds instead of set of segments. But to be sure, I'd assume it's going to be a new routing policy, e.g. as you called instancePartitionReplicaGroup rather than extending the strictReplicaPolicy, as this new routing policy would assume the table is partitioned.

tibrewalpratik17 commented 5 months ago

But to be sure, I'd assume it's going to be a new routing policy, e.g. as you called instancePartitionReplicaGroup rather than extending the strictReplicaPolicy, as this new routing policy would assume the table is partitioned.

Yes we would implement this as a new routing-strategy.

apache / pinot