apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.44k stars 1.27k forks source link

Enhance implicit partitioned replica-group assignment behavior for Pinot Upsert #12146

Open deemoliu opened 10 months ago

deemoliu commented 10 months ago

Currently for Upsert tables, Using implicit partitioned replica-group assignment from low-level consumer won't persist the instance assignment (mapping from partition to servers) to the ZooKeeper, and new added servers will be automatically included without explicit reassigning instances (usually through rebalance).

To provide an example, we create a Upsert table with BalanceNumSegmentAssignmentStrategy (2 replicas), on a 4 nodes tenant. the partitions can be assigned to

Partition0: server0, server1
Partition1: server2, server3
Partition2: server0, server1
Partition3: server2, server3

When adding one extra server without rebalancing the table, we started to see

Partition0: server0, server1, newServer
Partition1: server2, server3
Partition2: server0, server1, newServer
Partition3: server2, server3

The newServer hosting primaryKeys of partition0 but not all the primarykeys are hosted on newServer, and it will failed to lookup the primary keys during ingestion, and duplicates keys and incorrect query results.

The concerns of using implicit partitioned replica-group assignment is, adding new node and rebalancing the table are not atomic operations. After a tenant expansion and before the table get rebalanced, we will see incorrect result for Upsert table.

Is there any reason/scenarios that we need the current behavior of the implicit assignment? Shall we change the implicit assignment behavior to be the same as the explicit assignment?

deemoliu commented 10 months ago

cc: @yupeng9 @Jackie-Jiang @rohityadav1993 @MeihanLi @eaugene @tibrewalpratik17

Jackie-Jiang commented 10 months ago

This issue itself is addressed by #11628. But I'm +1 on only allow explicit replica-group assignment for upsert table.