apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.13k stars 408 forks source link

[VL] OOM test in CI fails sometimes #6357

Open PHILO-HE opened 1 month ago

PHILO-HE commented 1 month ago

Backend

VL (Velox)

Bug description

[Expected behavior] and [actual behavior].

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

Running query q23a with coordinate ISOLATION -> OFF, OFFHEAP_SIZE -> 2g, FLUSH_MODE -> DISABLED (iteration 0)...
Executing SQL query from resource path /tpcds-queries/q23a.sql...
E20240705 12:32:59.361933 15106 Exceptions.h:67] Line: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp:370, Function:next, Expression: stop == StopReason::kBlock || stop == StopReason::kAtEnd || stop == StopReason::kAlreadyTerminated || stop == StopReason::kTerminate , Source: RUNTIME, ErrorCode: INVALID_STATE
24/07/05 12:32:59 ERROR TaskResources: Task 139 failed by error: 
org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Error during calling Java code from native code: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Retriable: False
Expression: stop == StopReason::kBlock || stop == StopReason::kAtEnd || stop == StopReason::kAlreadyTerminated || stop == StopReason::kTerminate
Function: next
File: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 370
Stack trace:
# 0  
# 1  
# 2  
# 3  
# 4  
# 5  
# 6  
# 7  

    at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
    at org.apache.gluten.utils.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
    at org.apache.gluten.utils.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
    at org.apache.gluten.utils.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
    at org.apache.gluten.utils.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.gluten.utils.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
    at org.apache.gluten.vectorized.GeneralInIterator.hasNext(GeneralInIterator.java:31)
    at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
    at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:61)
    at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
    at org.apache.gluten.utils.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
    at org.apache.gluten.utils.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
    at org.apache.gluten.utils.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:121)
    at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:231)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
ulysses-you commented 1 month ago

I face this issue too

zhztheplayer commented 1 month ago

I have been aware of this. Will take a look when possible.