apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.4k stars 1.26k forks source link

PartitionAware Routing does not work on Realtime Upsert Table using kinesis #12308

Open abhijeetkushe opened 8 months ago

abhijeetkushe commented 8 months ago

Labels : Documentation, Troubleshooting

I have a realtime upsert table which reads data from kinesis.There are 2 partitions or 2 kinesis shards.I updated config but after I restarted broker controller and server I still see 2 servers being hit when the accountId is in the query. This was the original config.

"segmentPartitionConfig":{
  "columnPartitionMap": {
        "accountId": {
          "functionName": "Murmur",
          "numPartitions": 2
        }
      }
  }
  },  
"routing": {
  "segmentPrunerTypes": [
      "partition"
    ],
    "instanceSelectorType": "strictReplicaGroup"
  },

I tried the below config suggested on the following slack channel by @walterddr https://apache-pinot.slack.com/archives/C011C9JHN7R/p1704225307389139

  "columnPartitionMap": {
        "accountId": {
          "functionName": "Murmur",
          "numPartitions": 2
        }
      }
  }
  },
  "upsertConfig": {
    "mode": "FULL",
    "hashFunction": "MURMUR3",
    "enableSnapshot": true
  },
  "instanceAssignmentConfigMap": {
    "CONSUMING": {
      "tagPoolConfig": {
        "tag": "DefaultTenant_REALTIME"
      },
      "replicaGroupPartitionConfig": {
        "numInstances": 8,
        "replicaGroupBased": true,
        "numReplicaGroups": 4,
        "numInstancesPerReplicaGroup": 2,
        "partitionColumn": "accountId",
        "numPartitions": 2,
        "numInstancesPerPartition": 1
      }
    }
  },
  "routing": {
  "segmentPrunerTypes": [
      "partition"
    ],
    "instanceSelectorType": "strictReplicaGroup"
  }```

  When I run the query 

SELECT currencyCode,activityId, activityType, distinctcount(sendIdContactId2) as 'sendId,contactId_count', sum(totalOrderNum) as totalOrderNum_sum, sumprecision(totalOrderAmt, 19) as totalOrderAmt_sum FROM events WHERE accountId = 1140607508363 AND flowId = 'd13c223f-b3fb-4fac-8689-77b89ab08f82' AND recordType = 'attribution' AND softDelete = 'null' group by currencyCode,activityType,activityId LIMIT 0, 1000 option(timeoutMs=20000)SELECT currencyCode,activityId, activityType, distinctcount(sendIdContactId2) as 'sendId,contactId_count', sum(totalOrderNum) as totalOrderNum_sum, sumprecision(totalOrderAmt, 19) as totalOrderAmt_sum FROM events WHERE accountId = 1140607508363 AND flowId = 'd13c223f-b3fb-4fac-8689-77b89ab08f82' AND recordType = 'attribution' AND softDelete = 'null' group by currencyCode,activityType,activityId LIMIT 0, 1000 option(timeoutMs=20000)


I still see 2 servers queried  

![Screenshot 2024-01-22 at 3 42 40 PM](https://github.com/apache/pinot/assets/2093096/3d093b81-a698-472d-a574-20afcad6c44a)
Jackie-Jiang commented 7 months ago

Can you check if the upstream is partitioned properly? You may read the segment ZK metadata through the rest API and check if each segment only contain data from a single partition