apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

Spark ColumnarToRowExec cannot pass CometBuffer safety check #1059

Open viirya opened 2 weeks ago

viirya commented 2 weeks ago

Describe the bug

This was found during debugging CI failures of #1050.

One example of failed test is date_add with int scalars in CometExpressionSuite. The query is "SELECT _20 + CAST(2 as $intType) from tbl which has simply a CometScan + CometProject + Spark ColumnarToRowExec.

CometProject (i.e., DataFusion ProjectExec) doesn't store arrays internally. The only possibility that fails the safety check is that the arrays are not released before we fill next values into the CometBuffers.

In Spark ColumnarToRowExec, once it pulls out all rows from current ColumnarBatch, it simply assigns it to null to release the JVM object, but close is never called on the batch object to release vector resources (e.g., for Comet, Arrow arrays). It is more complicated than just add a close call there because Spark uses WritableColumnVector there for some components (e.g., Parquet reader). Once close is called on a WritableColumnVector, it will make the vector "not" writable anymore.

To completely fix it, we need some changes in Spark. I did a quick experiment locally in Spark and verified that if a close is properly called on non WritableColumnVector there, failed tests can pass without failing the safety check.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

viirya commented 2 weeks ago

The Spark fix: https://github.com/apache/spark/pull/48767