apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.39k stars 1.26k forks source link

Error with Data size larger than 1M, will not write to zk. Data (first 1k) #7704

Open dongxiaoman opened 2 years ago

dongxiaoman commented 2 years ago

This is not critical but shows some problems worth investigating.

Our QA cluster is having trouble with some accumulated data. It has 10k+ real time segments in place, data are not too much.

In logs we see something like below:

2021/11/04 20:31:34.166 ERROR [ZkClient] [HelixTaskExecutor-message_handle_thread] Data size larger than 1M, will not write to zk. Data (first 1k): {
  "id" : "point_entry_REALTIME",
  "simpleFields" : {
    "BATCH_MESSAGE_MODE" : "false",
    "BUCKET_SIZE" : "0",
    "SESSION_ID" : "30069443a0581e1",
    "STATE_MODEL_DEF" : "SegmentOnlineOfflineStateModel",
    "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
  },
  "mapFields" : {
    "point_entry__0__0__20211030T0056Z" : {
      "CURRENT_STATE" : "OFFLINE"
    },
    "point_entry__0__100__20211102T0746Z" : {
      "CURRENT_STATE" : "OFFLINE"
    },
    "point_entry__0__101__20211102T0817Z" : {
      "CURRENT_STATE" : "OFFLINE"
    },
    "point_entry__0__102__20211102T0909Z" : {
      "CURRENT_STATE" : "OFFLINE"
    },
    "point_entry__0__103__20211102T0946Z" : {
      "CURRENT_STATE" : "ONLINE",
      "END_TIME" : "1636056441791",
      "INFO
mcvsubbu commented 2 years ago

In the current version of helix that we use in Pinot, data over 1M is automatically compressed by Helix before writing to the zookeeper. I think your compressed data is exceeding this limit (helix provides only one limit in 0.9.x, we have requested to have two limits and they will provide it in 1.x)

https://github.com/apache/helix/blob/master/zookeeper-api/src/main/java/org/apache/helix/zookeeper/datamodel/ZNRecord.java#L63:27

The best way is to remove some segments or set your segment size to be larger so that less number of segments are made.

dongxiaoman commented 2 years ago

I just realized that it is probably because I have 192 partitions and "replication": "2" setting, while I have only 4 hosts to process them. I believe the real time servers is trying to go online for all the 192*2/4 segments at the same time (in one message) which generates a very large JSON file that exceeds the 1MB size. Right now if I add more hosts it should solve the problem for me.