apache / iotdb

Apache IoTDB
https://iotdb.apache.org/
Apache License 2.0
5.6k stars 1.02k forks source link

write rejected #11375

Open LangYuanzh opened 1 year ago

LangYuanzh commented 1 year ago

version 1.2.0

Write rejected when insert data via java interface: spark.scheduler.ResultTask.runTask(ResultTask.scala:90) spark.scheduler.Task.run(Task.scala:131) spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) spark.executor.Executor$TaskRunner.run(Executor.scala:509) concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Thread.run(Thread.java:750) .iotdb.rpc.StatementExecutionException: 606: The write is rejected because the wal directory size has reached the thresholold 53687091200 bytes. You may need to adjust the flush policy of the storage engine or the IoTConsen: iotdb.session.Session.insertByGroup(Session.java:3254) iotdb.session.Session.insertRecordsWithLeaderCache(Session.java:2178) iotdb.session.Session.insertAlignedRecords(Session.java:1671) seatunnel.connectors.seatunnel.iotdb.sink.IoTDBSinkClient.insertAlignedRecords(IoTDBSinkClient.java:217).seatunnel.connectors.seatunnel.iotdb.sink.IoTDBSinkClient.flush(IoTDBSinkClient.java:160)

Change which parameter can avoid this?

jixuan1989 commented 1 year ago

can relieve but cannot avoid. Some feature may cause this and will be fixed in the next version.

LangYuanzh commented 1 year ago

can relieve but cannot avoid. Some feature may cause this and will be fixed in the next version.

which parameter can relieve?

LangYuanzh commented 1 year ago

iotdb is deployment as cluster with 3 datanodes: replication of schema data: 3 replication of data: 2

After few hours write, only wal idr of datanode 02 is reached 51G, and wal dir size of 01 and 03 is only 2.3G and 2.6G. following logs is find on datanode 02 : _2023-10-30 10:40:40,360 [pool-6-IoTDB-WAL-Delete-1] WARN o.a.i.db.wal.WALManager:180 - WAL disk usage 53712680240 is larger than the iot_consensus_throttle_threshold_in_byte 42949672960, please check your write load, iot consensus and the pipe module. It's better to allocate more disk for WAL.

why the wal data is so unbalanced?

wanghui42 commented 1 year ago

三个节点的配置和硬件是否一致,数据分布是否均匀(每个设备的写入频率?),是否有一些轮询操作,是否使用一些其他功能(pipe?视图?)。可以在三个节点上打个火焰图看看线程池在忙啥

LangYuanzh commented 1 year ago

三个节点的配置和硬件是否一致,数据分布是否均匀(每个设备的写入频率?),是否有一些轮询操作,是否使用一些其他功能(pipe?视图?)。可以在三个节点上打个火焰图看看线程池在忙啥

三个节点硬件一致均为8C16G。没有轮询,视图等特殊功能。目前停掉了写入和查询。 节点1 和节点 3上WAL文件会正常清理掉。但是节点2上wal日志一直未清理。

修改了配置文件中的iot_consensus_throttle_threshold_in_byte参数,但是看日志未生效。 另外日志里看到了一些compaction内存不足的INFO日志,不确定是否有影响。

2023-10-31 04:45:47,615 [pool-33-IoTDB-Compaction-Worker-5] INFO o.a.i.d.e.c.e.t.CrossSpaceCompactionTask:387 - No enough memory for current compaction task root.pre_trc-26-2817 task seq files are [file is /data1/iotdb/apache-iotdb-1.2.0-all-bin/data/datanode/data/sequence/root.pre_trc/26/2817/1697441628722-1-2-1.tsfile, status: COMPACTION_CANDIDATE, file is /data1/iotdb/apache-iotdb-1.2.0-all-bin/data/datanode/data/sequence/root.pre_trc/26/2817/1697442224603-63-5-0.tsfile, status: COMPACTION_CANDIDATE] , unseq files are [file is /data1/iotdb/apache-iotdb-1.2.0-all-bin/data/datanode/data/unsequence/root.pre_trc/26/2817/1697934461705-1686-9-0.tsfile, status: COMPACTION_CANDIDATE] org.apache.iotdb.db.engine.compaction.execute.exception.CompactionMemoryNotEnoughException: Required memory cost 7444181649 bytes is greater than the total memory budget for compaction 3514064151 bytes at org.apache.iotdb.db.rescon.SystemInfo.addCompactionMemoryCost(SystemInfo.java:223) at org.apache.iotdb.db.engine.compaction.execute.task.CrossSpaceCompactionTask.checkValidAndSetMerging(CrossSpaceCompactionTask.java:380) at org.apache.iotdb.db.engine.compaction.schedule.CompactionWorker.run(CompactionWorker.java:58) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) 2023-10-31 04:45:50,201 [AsyncDataNodeIoTConsensusServiceClientPool-selector-94] INFO o.a.i.c.i.c.AsyncIoTConsensusServiceClient:112 - Unexpected exception occurs in AsyncConfigNodeIServiceClient{TEndPoint(ip:iotdb-aesc-trc0001, port:10760)} : java.lang.IllegalStateException: Client has an error! at org.apache.thrift.async.TAsyncClient.checkReady(TAsyncClient.java:83) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient.isReady(AsyncIoTConsensusServiceClient.java:109) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient$Factory.validateObject(AsyncIoTConsensusServiceClient.java:152) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient$Factory.validateObject(AsyncIoTConsensusServiceClient.java:122) at org.apache.commons.pool2.impl.GenericKeyedObjectPool.returnObject(GenericKeyedObjectPool.java:1470) at org.apache.iotdb.commons.client.ClientManager.lambda$returnClient$0(ClientManager.java:70) at java.base/java.util.Optional.ifPresent(Optional.java:178) at org.apache.iotdb.commons.client.ClientManager.returnClient(ClientManager.java:67) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient.returnSelf(AsyncIoTConsensusServiceClient.java:99) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient.onError(AsyncIoTConsensusServiceClient.java:74) at org.apache.thrift.async.TAsyncMethodCall.onError(TAsyncMethodCall.java:215) at org.apache.thrift.async.TAsyncMethodCall.transition(TAsyncMethodCall.java:210) at org.apache.thrift.async.TAsyncClientManager$SelectThread.transitionMethods(TAsyncClientManager.java:143) at org.apache.thrift.async.TAsyncClientManager$SelectThread.run(TAsyncClientManager.java:113) Caused by: org.apache.thrift.transport.TTransportException: Read call frame size failed at org.apache.thrift.async.TAsyncMethodCall.doReadingResponseSize(TAsyncMethodCall.java:246) at org.apache.thrift.async.TAsyncMethodCall.transition(TAsyncMethodCall.java:198) ... 2 common frames omitted 2023-10-31 04:45:50,201 [AsyncDataNodeIoTConsensusServiceClientPool-selector-94] WARN o.a.i.c.i.c.DispatchLogHandler:81 - Can not send Batch{startIndex=3549, endIndex=3549, size=1, serializedSize=214780866} to peer for Peer{groupId=DataRegion[26], endpoint=TEndPoint(ip:iotdb-aesc-trc0001, port:10760), nodeId=1} times 333 because {} org.apache.thrift.transport.TTransportException: Read call frame size failed at org.apache.thrift.async.TAsyncMethodCall.doReadingResponseSize(TAsyncMethodCall.java:246) at org.apache.thrift.async.TAsyncMethodCall.transition(TAsyncMethodCall.java:198) at org.apache.thrift.async.TAsyncClientManager$SelectThread.transitionMethods(TAsyncClientManager.java:143) at org.apache.thrift.async.TAsyncClientManager$SelectThread.run(TAsyncClientManager.java:113) 2023-10-31 04:45:58,225 [pool-6-IoTDB-WAL-Delete-1] WARN o.a.i.db.wal.WALManager:180 - WAL disk usage 53712680440 is larger than the iot_consensus_throttle_threshold_in_byte 50949672960, please check your write load, iot consensus and the pipe module. It's better to allocate more disk for WAL. 2023-10-31 04:46:00,061 [AsyncDataNodeIoTConsensusServiceClientPool-selector-94] INFO o.a.i.c.i.c.AsyncIoTConsensusServiceClient:112 - Unexpected exception occurs in AsyncConfigNodeIServiceClient{TEndPoint(ip:iotdb-aesc-trc0001, port:10760)} : java.lang.IllegalStateException: Client has an error! at org.apache.thrift.async.TAsyncClient.checkReady(TAsyncClient.java:83) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient.isReady(AsyncIoTConsensusServiceClient.java:109) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient$Factory.validateObject(AsyncIoTConsensusServiceClient.java:152) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient$Factory.validateObject(AsyncIoTConsensusServiceClient.java:122) at org.apache.commons.pool2.impl.GenericKeyedObjectPool.returnObject(GenericKeyedObjectPool.java:1470) at org.apache.iotdb.commons.client.ClientManager.lambda$returnClient$0(ClientManager.java:70) at java.base/java.util.Optional.ifPresent(Optional.java:178) at org.apache.iotdb.commons.client.ClientManager.returnClient(ClientManager.java:67) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient.returnSelf(AsyncIoTConsensusServiceClient.java:99) at org.apache.iotdb.consensus.iot.client.AsyncIoTConsensusServiceClient.onError(AsyncIoTConsensusServiceClient.java:74) at org.apache.thrift.async.TAsyncMethodCall.onError(TAsyncMethodCall.java:215) at org.apache.thrift.async.TAsyncMethodCall.transition(TAsyncMethodCall.java:210) at org.apache.thrift.async.TAsyncClientManager$SelectThread.transitionMethods(TAsyncClientManager.java:143) at org.apache.thrift.async.TAsyncClientManager$SelectThread.run(TAsyncClientManager.java:113) Caused by: org.apache.thrift.transport.TTransportException: Read call frame size failed at org.apache.thrift.async.TAsyncMethodCall.doReadingResponseSize(TAsyncMethodCall.java:246) at org.apache.thrift.async.TAsyncMethodCall.transition(TAsyncMethodCall.java:198) ... 2 common frames omitted 2023-10-31 04:46:00,062 [AsyncDataNodeIoTConsensusServiceClientPool-selector-94] WARN o.a.i.c.i.c.DispatchLogHandler:81 - Can not send Batch{startIndex=3550, endIndex=3550, size=1, serializedSize=214781055} to peer for Peer{groupId=DataRegion[26], endpoint=TEndPoint(ip:iotdb-aesc-trc0001, port:10760), nodeId=1} times 324 because {} org.apache.thrift.transport.TTransportException: Read call frame size failed at org.apache.thrift.async.TAsyncMethodCall.doReadingResponseSize(TAsyncMethodCall.java:246) at org.apache.thrift.async.TAsyncMethodCall.transition(TAsyncMethodCall.java:198) at org.apache.thrift.async.TAsyncClientManager$SelectThread.transitionMethods(TAsyncClientManager.java:143) at org.apache.thrift.async.TAsyncClientManager$SelectThread.run(TAsyncClientManager.java:113) 2023-10-31 04:46:18,229 [pool-6-IoTDB-WAL-Delete-1] WARN o.a.i.db.wal.WALManager:180 - WAL disk usage 53712680440 is larger than the iot_consensus_throttle_threshold_in_byte 50949672960, please check your write load, iot consensus and the pipe module. It's better to allocate more disk for WAL.

OneSizeFitsQuorum commented 1 year ago

Hi, I noticed that you are writing a request size of more than 200M, which is actually too large. We suggest reducing the batch size to reduce the memory pressure on the server.

In addition, in version 1.2.0, IoTConsensus was unable to synchronize requests larger than 100M in size, which could cause individual nodes to pile up the wal log after receiving a large request, up to 50GB. This issue was fixed in version 1.2.2.

It is recommended to upgrade to 1.2.2, reduce the batch size appropriately and try again

LangYuanzh commented 1 year ago

Hi, I noticed that you are writing a request size of more than 200M, which is actually too large. We suggest reducing the batch size to reduce the memory pressure on the server.

In addition, in version 1.2.0, IoTConsensus was unable to synchronize requests larger than 100M in size, which could cause individual nodes to pile up the wal log after receiving a large request, up to 50GB. This issue was fixed in version 1.2.2.

It is recommended to upgrade to 1.2.2, reduce the batch size appropriately and try again

Thansk for your advice. I have upgrade to 1.2.2 and WAL problem seems solved. But I have seen some INFO log about compaction memory both in 1.2.0 and 1.2.2, I have change the memory ratio of storage engine and it seems can not be avoid. It's this any affect? 2023-11-01 06:09:29,157 [pool-44-IoTDB-Compaction-Worker-8] INFO o.a.i.d.s.d.c.e.t.CrossSpaceCompactionTask:397 - No enough memory for current compaction task root.pre_trc-27-2922 task seq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/sequence/root.pre_trc/27/2922/1697717961585-1-2-4.tsfile, status: COMPACTION_CANDIDATE] , unseq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/unsequence/root.pre_trc/27/2922/1698751814360-23-1-0.tsfile, status: COMPACTION_CANDIDATE] org.apache.iotdb.db.storageengine.dataregion.compaction.execute.exception.CompactionMemoryNotEnoughException: Required memory cost 1115215382 bytes is greater than the total memory budget for compaction 823216046 bytes at org.apache.iotdb.db.storageengine.rescon.memory.SystemInfo.addCompactionMemoryCost(SystemInfo.java:223) at org.apache.iotdb.db.storageengine.dataregion.compaction.execute.task.CrossSpaceCompactionTask.checkValidAndSetMerging(CrossSpaceCompactionTask.java:389) at org.apache.iotdb.db.storageengine.dataregion.compaction.schedule.CompactionWorker.run(CompactionWorker.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

shuwenwei commented 1 year ago

Hi, I noticed that you are writing a request size of more than 200M, which is actually too large. We suggest reducing the batch size to reduce the memory pressure on the server. In addition, in version 1.2.0, IoTConsensus was unable to synchronize requests larger than 100M in size, which could cause individual nodes to pile up the wal log after receiving a large request, up to 50GB. This issue was fixed in version 1.2.2. It is recommended to upgrade to 1.2.2, reduce the batch size appropriately and try again

Thansk for your advice. I have upgrade to 1.2.2 and WAL problem seems solved. But I have seen some INFO log about compaction memory both in 1.2.0 and 1.2.2, I have change the memory ratio of storage engine and it seems can not be avoid. It's this any affect? 2023-11-01 06:09:29,157 [pool-44-IoTDB-Compaction-Worker-8] INFO o.a.i.d.s.d.c.e.t.CrossSpaceCompactionTask:397 - No enough memory for current compaction task root.pre_trc-27-2922 task seq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/sequence/root.pre_trc/27/2922/1697717961585-1-2-4.tsfile, status: COMPACTION_CANDIDATE] , unseq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/unsequence/root.pre_trc/27/2922/1698751814360-23-1-0.tsfile, status: COMPACTION_CANDIDATE] org.apache.iotdb.db.storageengine.dataregion.compaction.execute.exception.CompactionMemoryNotEnoughException: Required memory cost 1115215382 bytes is greater than the total memory budget for compaction 823216046 bytes at org.apache.iotdb.db.storageengine.rescon.memory.SystemInfo.addCompactionMemoryCost(SystemInfo.java:223) at org.apache.iotdb.db.storageengine.dataregion.compaction.execute.task.CrossSpaceCompactionTask.checkValidAndSetMerging(CrossSpaceCompactionTask.java:389) at org.apache.iotdb.db.storageengine.dataregion.compaction.schedule.CompactionWorker.run(CompactionWorker.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Do you use aligned series for storage?

LangYuanzh commented 1 year ago

Hi, I noticed that you are writing a request size of more than 200M, which is actually too large. We suggest reducing the batch size to reduce the memory pressure on the server. In addition, in version 1.2.0, IoTConsensus was unable to synchronize requests larger than 100M in size, which could cause individual nodes to pile up the wal log after receiving a large request, up to 50GB. This issue was fixed in version 1.2.2. It is recommended to upgrade to 1.2.2, reduce the batch size appropriately and try again

Thansk for your advice. I have upgrade to 1.2.2 and WAL problem seems solved. But I have seen some INFO log about compaction memory both in 1.2.0 and 1.2.2, I have change the memory ratio of storage engine and it seems can not be avoid. It's this any affect? 2023-11-01 06:09:29,157 [pool-44-IoTDB-Compaction-Worker-8] INFO o.a.i.d.s.d.c.e.t.CrossSpaceCompactionTask:397 - No enough memory for current compaction task root.pre_trc-27-2922 task seq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/sequence/root.pre_trc/27/2922/1697717961585-1-2-4.tsfile, status: COMPACTION_CANDIDATE] , unseq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/unsequence/root.pre_trc/27/2922/1698751814360-23-1-0.tsfile, status: COMPACTION_CANDIDATE] org.apache.iotdb.db.storageengine.dataregion.compaction.execute.exception.CompactionMemoryNotEnoughException: Required memory cost 1115215382 bytes is greater than the total memory budget for compaction 823216046 bytes at org.apache.iotdb.db.storageengine.rescon.memory.SystemInfo.addCompactionMemoryCost(SystemInfo.java:223) at org.apache.iotdb.db.storageengine.dataregion.compaction.execute.task.CrossSpaceCompactionTask.checkValidAndSetMerging(CrossSpaceCompactionTask.java:389) at org.apache.iotdb.db.storageengine.dataregion.compaction.schedule.CompactionWorker.run(CompactionWorker.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Do you use aligned series for storage?

yes, all time series is aligned

shuwenwei commented 1 year ago

Hi, I noticed that you are writing a request size of more than 200M, which is actually too large. We suggest reducing the batch size to reduce the memory pressure on the server. In addition, in version 1.2.0, IoTConsensus was unable to synchronize requests larger than 100M in size, which could cause individual nodes to pile up the wal log after receiving a large request, up to 50GB. This issue was fixed in version 1.2.2. It is recommended to upgrade to 1.2.2, reduce the batch size appropriately and try again

Thansk for your advice. I have upgrade to 1.2.2 and WAL problem seems solved. But I have seen some INFO log about compaction memory both in 1.2.0 and 1.2.2, I have change the memory ratio of storage engine and it seems can not be avoid. It's this any affect? 2023-11-01 06:09:29,157 [pool-44-IoTDB-Compaction-Worker-8] INFO o.a.i.d.s.d.c.e.t.CrossSpaceCompactionTask:397 - No enough memory for current compaction task root.pre_trc-27-2922 task seq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/sequence/root.pre_trc/27/2922/1697717961585-1-2-4.tsfile, status: COMPACTION_CANDIDATE] , unseq files are [file is /data1/iotdb/apache-iotdb-1.2.2-all-bin/data/datanode/data/unsequence/root.pre_trc/27/2922/1698751814360-23-1-0.tsfile, status: COMPACTION_CANDIDATE] org.apache.iotdb.db.storageengine.dataregion.compaction.execute.exception.CompactionMemoryNotEnoughException: Required memory cost 1115215382 bytes is greater than the total memory budget for compaction 823216046 bytes at org.apache.iotdb.db.storageengine.rescon.memory.SystemInfo.addCompactionMemoryCost(SystemInfo.java:223) at org.apache.iotdb.db.storageengine.dataregion.compaction.execute.task.CrossSpaceCompactionTask.checkValidAndSetMerging(CrossSpaceCompactionTask.java:389) at org.apache.iotdb.db.storageengine.dataregion.compaction.schedule.CompactionWorker.run(CompactionWorker.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Do you use aligned series for storage?

yes, all time series is aligned

Currently, the aligned series compaction may require more memory. You can continue to adjust the global memory allocation ratio and the memory allocation ratio of the storage engine, or simply ignore this log if other Compaction task can run successfully.

# Memory Allocation Ratio: StorageEngine, QueryEngine, SchemaEngine, Consensus, StreamingEngine and Free Memory.
# The parameter form is a:b:c:d:e:f, where a, b, c, d, e and f are integers. for example: 1:1:1:1:1:1 , 6:2:1:1:1:1
# If you have high level of writing pressure and low level of reading pressure, please adjust it to for example 6:1:1:1:1:1
# datanode_memory_proportion=3:3:1:1:1:1

# Memory allocation ratio in StorageEngine: Write, Compaction
# The parameter form is a:b:c:d, where a, b, c and d are integers. for example: 8:2 , 7:3
# storage_engine_memory_proportion=8:2