apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.26k stars 1.23k forks source link

Server crashing with OOM error #13335

Open vineethvp opened 4 weeks ago

vineethvp commented 4 weeks ago

Data is ingested from Kafka to realtime table with below config. It is having replication factor 2.

Table config

"instanceAssignmentConfigMap": { "CONSUMING": { "tagPoolConfig": { "tag": "DefaultTenant_REALTIME" }, "replicaGroupPartitionConfig": { "numInstances": 3 } }

"tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "realtime.segment.flush.threshold.rows": "0", "realtime.segment.flush.threshold.time": "24h", "realtime.segment.flush.threshold.segment.size": "300M", "stream.kafka.consumer.prop.auto.offset.reset": "largest" } },

Getting below error after inserting around 60M records, and server is crashing.

Consumed 5722 events from (rate:63.164402/s), currentOffset=389448757, numRowsConsumedSoFar=257283, numRowsIndexedSoFar=257283 Consumed 3216 events from (rate:37.62988/s), currentOffset=376002707, numRowsConsumedSoFar=252239, numRowsIndexedSoFar=252239 Consumed 4045 events from (rate:46.583138/s), currentOffset=376940305, numRowsConsumedSoFar=262888, numRowsIndexedSoFar=262888 Consumed 2880 events from (rate:40.99819/s), currentOffset=372932263, numRowsConsumedSoFar=246467, numRowsIndexedSoFar=246467 Consumed 3625 events from (rate:46.28802/s), currentOffset=381415112, numRowsConsumedSoFar=249777, numRowsIndexedSoFar=249777 Consumed 2898 events from (rate:35.928143/s), currentOffset=374420450, numRowsConsumedSoFar=252748, numRowsIndexedSoFar=252748 Slow query: request handler processing time: 6786, send response latency: 7946, total time to handle request: 14732 Consumed 3757 events from (rate:38.51003/s), currentOffset=383381895, numRowsConsumedSoFar=254725, numRowsIndexedSoFar=254725 Client session timed out, have not heard from server in 20317ms for session id 0x2000fcc5a450005 Session 0x2000fcc5a450005 for server pinot-zookeeper/172.20.29.227:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 20317ms for session id 0x2000fcc5a450005 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1242) [pinot-all-1.2.0-SNAPSHOT-jar-with-dependencies.jar:1.2.0-SNAPSHOT-1d1d25dc0f1fc1abb73d9516414168c82b116b58] zkclient 3, zookeeper state changed ( Disconnected ) [Consumer clientId=events_REALTIME-CENTRAL-STREAMING-NW-EVENTS-21, groupId=null] Error sending fetch request (sessionId=214284632, epoch=84) to node 1: org.apache.pinot.shaded.org.apache.kafka.common.errors.DisconnectException: null Opening socket connection to server pinot-zookeeper/172.20.29.227:2181. SASL config status: Will not attempt to authenticate using SASL (unknown error) Timed out while polling results block, numBlocksMerged: 0 (query: QueryContext{_tableName='events_REALTIME', _subquery=null, _selectExpressions=[channel_util, cpu_util, customer_id, data, data_source, device_id, device_type, event_type, fan_id, fan_status, is_stack_switch, mac_address, mem_util, member_id, noise_floor, power_supply_id, power_supply_status, radio_band, radio_type, sensor_id, sensor_temperature, sensor_temperature_trend, site_id, stack_id, status, sub_account_id, sub_site_id, timestamp, trend_value, upload_timestamp, upload_ts_millis, uptime, uptime_string], _distinct=false, _aliasList=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null], _filter=null, _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=10000, _offset=0, _queryOptions={timeoutMs=10000}, _expressionOverrideHints={}, _explain=false}) Consumed 500 events from (rate:7.955576/s), currentOffset=373866452, numRowsConsumedSoFar=257550, numRowsIndexedSoFar=257550 Consumed 1735 events from (rate:15.82827/s), currentOffset=376659150, numRowsConsumedSoFar=252384, numRowsIndexedSoFar=252384 Consumed 3230 events from (rate:27.967546/s), currentOffset=374117225, numRowsConsumedSoFar=248670, numRowsIndexedSoFar=248670 Socket connection established, initiating session, client: /10.3.145.54:33952, server: pinot-zookeeper/172.20.29.227:2181 Consumed 250 events from (rate:3.7271154/s), currentOffset=380505319, numRowsConsumedSoFar=237006, numRowsIndexedSoFar=237006 Exception in thread "events18520240607T0718Z" java.lang.OutOfMemoryError: Java heap space Exception in thread "events29520240607T0719Z" java.lang.OutOfMemoryError: Java heap space Consumed 1000 events from (rate:8.947665/s), currentOffset=383012407, numRowsConsumedSoFar=250268, numRowsIndexedSoFar=250268 Exception in thread "req-rsp-timeout-task" java.lang.OutOfMemoryError: Java heap space at jdk.httpserver/sun.net.httpserver.ServerImpl$ReqRspTimeoutTask.run(ServerImpl.java:1026) at java.base/java.util.TimerThread.mainLoop(Timer.java:556) at java.base/java.util.TimerThread.run(Timer.java:506) [Consumer clientId=events_REALTIME-CENTRAL-STREAMING-NW-EVENTS-21, groupId=null] Error sending fetch request (sessionId=214284632, epoch=INITIAL) to node 1: org.apache.pinot.shaded.org.apache.kafka.common.errors.TimeoutException: Failed to send request after 30000 ms. Session 0x2000fcc5a450005 for server pinot-zookeeper/172.20.29.227:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x2000fcc5a450005, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[pinot-all-1.2.0-SNAPSHOT-jar-with-dependencies.jar:1.2.0-SNAPSHOT-1d1d25dc0f1fc1abb73d9516414168c82b116b58] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[pinot-all-1.2.0-SNAPSHOT-jar-with-dependencies.jar:1.2.0-SNAPSHOT-1d1d25dc0f1fc1abb73d9516414168c82b116b58] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1274) [pinot-all-1.2.0-SNAPSHOT-jar-with-dependencies.jar:1.2.0-SNAPSHOT-1d1d25dc0f1fc1abb73d9516414168c82b116b58]

Server config

resources: requests: memory: "300Mi" cpu: "300m" limits: memory: "4000Mi" cpu: "4000m"

Jackie-Jiang commented 3 weeks ago

A thread dump can help capture the hotspot objects. This is also a good candidate question in the slack troubleshooting channel