apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
447 stars 100 forks source link

Memory leak reported by Java Arrow on q12 of CometTPCHQuerySuite #336

Closed viirya closed 1 week ago

viirya commented 3 weeks ago

Describe the bug

324 fixed a bug of CometShuffleExchangeExec's logical link, it changes query plans.

Due to the change, CometTPCHQuerySuite's q12 has test failure that is memory leak reported by Java Arrow:

- q12 *** FAILED *** (2 seconds, 244 milliseconds)
  java.lang.Exception: Expected "struct<[l_shipmode:string,high_line_count:bigint,low_line_count:bigint]>", but got "struct<[]>" Schema did not match
-- using default substitutions

select
    l_shipmode,
    sum(case
        when o_orderpriority = '1-URGENT'
            or o_orderpriority = '2-HIGH'
            then 1
        else 0
    end) as high_line_count,
    sum(case
        when o_orderpriority <> '1-URGENT'
            and o_orderpriority <> '2-HIGH'
            then 1
        else 0
    end) as low_line_count
from
    orders,
    lineitem
where
    o_orderkey = l_orderkey
    and l_shipmode in ('MAIL', 'SHIP')
    and l_commitdate < l_receiptdate
    and l_shipdate < l_commitdate
    and l_receiptdate >= date '1994-01-01'
    and l_receiptdate < date '1994-01-01' + interval '1' year
group by
    l_shipmode
order by
    l_shipmode
Output/Exception: java.lang.IllegalStateException
Memory was leaked by query. Memory leaked: (49152)
Allocator(ROOT) 0/49152/180352/9223372036854775807 (res/actual/peak/limit)
Error using configs:
spark.sql.autoBroadcastJoinThreshold=10485760

I spent some time on debugging it, and found it seems caused by native shuffle (CometTPCHQuerySuite uses native shuffle for now). During debugging, I found that the leak is occurred on the allocation in StreamReader. The read batch is correctly closed after being used. But there is still 49152 bytes cannot be released on the allocator.

I'm not sure if it is a bug of Java Arrow.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

viirya commented 1 week ago

Duplicated to #381.