apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
11.87k stars 3.13k forks source link

[Bug] failed to submit compaction task for tablet: #9254

Open foobarrer opened 2 years ago

foobarrer commented 2 years ago

total data : 60Million use spark to read hive data, then save the data to Doris use spark-doris-connector-2.3_2.11, but when load about 20million, the spark job is dead with the log

2022-04-28 14:29:07 ERROR Executor:91 - Exception in task 2.0 in stage 0.0 (TID 2)
java.io.IOException: Failed to load data on BE: http://192.168.135.6:18040/api/dev/roadmatch1/_stream_load? node and exceeded the max retry times.
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1$$anonfun$org$apache$doris$spark$sql$DorisSourceProvider$$anonfun$$flush$1$1.apply$mcV$sp(DorisSourceProvider.scala:118)
        at scala.util.control.Breaks.breakable(Breaks.scala:38)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1.org$apache$doris$spark$sql$DorisSourceProvider$$anonfun$$flush$1(DorisSourceProvider.scala:92)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1$$anonfun$apply$2.apply(DorisSourceProvider.scala:78)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1$$anonfun$apply$2.apply(DorisSourceProvider.scala:70)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1.apply(DorisSourceProvider.scala:70)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1.apply(DorisSourceProvider.scala:68)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2022-04-28 14:29:06 ERROR Executor:91 - Exception in task 1.0 in stage 0.0 (TID 1)
java.io.IOException: Failed to load data on BE: http://192.168.135.5:18040/api/dev/roadmatch1/_stream_load? node and exceeded the max retry times.
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1$$anonfun$org$apache$doris$spark$sql$DorisSourceProvider$$anonfun$$flush$1$1.apply$mcV$sp(DorisSourceProvider.scala:118)
        at scala.util.control.Breaks.breakable(Breaks.scala:38)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1.org$apache$doris$spark$sql$DorisSourceProvider$$anonfun$$flush$1(DorisSourceProvider.scala:92)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1$$anonfun$apply$2.apply(DorisSourceProvider.scala:78)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1$$anonfun$apply$2.apply(DorisSourceProvider.scala:70)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1.apply(DorisSourceProvider.scala:70)
        at org.apache.doris.spark.sql.DorisSourceProvider$$anonfun$createRelation$1.apply(DorisSourceProvider.scala:68)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

before the log which I omit is the real data that failed to write to Doris, and at the save time the be print the log like below:

image I attach the more completed file about the two error message, cuz may helpful

spark.crash.log be.error.log

below is one of the host disk space info where Be inhabit in image

the disk space is not run out I saw one of the log is about tablet 14931, so I check the data storage image

and the result is below image more detail about the tablet is below: image

finally, here is my be.conf image

thx a lot for any help, cuz I already run out all solutions that I can have

I compile the code in master branch, the commit id is : 319f1f634a53f99deb2d2ee50d12defe17995516 the detail version info is below: image

use explain to check the table found : image

foobarrer commented 2 years ago
failed to submit compaction task for tablet: 11044, err: failed to prepare compaction task and calculate permits, tablet_id=11044, compaction_type=1, permit=0, current_permit=0, status=prepare compaction with err: -808

I create a new Doris cluster with 3 BE and 1 FE in 4 new machine with bigger space and memory. BUT after create a database and a table , the log show AGAGIN, what does it mean? , I can't read the c++ code, that why I ask again. Is there anybody willing to help the helpless and innocent poor guy...