apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

java.lang.ExceptionInInitializerError: Could not find comet-git-info.properties #1026

Closed BjarkeTornager closed 1 month ago

BjarkeTornager commented 1 month ago

Describe the bug

I have followed the building from source guide since I am on macOS. Only difference is that I ran the build with version 3.3: make release-nogit PROFILES="-Pspark-3.3".

With the produced jar from the build I can run Spark with Comet fine in the terminal like this:

export COMET_JAR=apache-datafusion-comet-0.3.0/spark/target/comet-spark-spark3.3_2.12-0.3.0.jar

SPARK_HOME/bin/spark-shell \
    --jars $COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    --conf spark.comet.explainFallback.enabled=true \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=16g

However, when adding comet spark to my spark config options in my own project like this:

"spark.jars": "apache-datafusion-comet-0.3.0/spark/target/comet-spark-spark3.3_2.12-0.3.0.jar",
"spark.driver.extraClassPath": "apache-datafusion-comet-0.3.0/spark/target/comet-spark-spark3.3_2.12-0.3.0.jar",
"spark.executor.extraClassPath": "apache-datafusion-comet-0.3.0/spark/target/comet-spark-spark3.3_2.12-0.3.0.jar",
"spark.plugins": "org.apache.spark.CometPlugin",
"spark.shuffle.manager": "org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager",
"spark.comet.explainFallback.enabled": "true",
"spark.memory.offHeap.enabled": "true",
"spark.memory.offHeap.size": "16g",

And running a spark test using pytest, which always succeeds when not adding the comet spark configurations mentioned above, I get the following exception:

---------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------
24/10/20 07:25:32 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
HashAggregate
+-  Exchange [COMET: Exchange is not native because the following children are not native (HashAggregate)]
   +-  HashAggregate [COMET: HashAggregate is not native because the following children are not native (Project)]
      +-  Project [COMET: Project is not native because the following children are not native (BroadcastHashJoin)]
         +-  BroadcastHashJoin [COMET: BroadcastHashJoin is not native because the following children are not native (Scan ExistingRDD, BroadcastExchange)]
            :-  Scan ExistingRDD [COMET: Scan ExistingRDD is not supported]
            +- BroadcastExchange
               +- CometProject
                  +- CometFilter
                     +- CometScanWrapper

24/10/20 07:25:32 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ExceptionInInitializerError
    at org.apache.comet.package$.<init>(package.scala:90)
    at org.apache.comet.package$.<clinit>(package.scala)
    at org.apache.comet.vector.NativeUtil.<init>(NativeUtil.scala:48)
    at org.apache.comet.CometExecIterator.<init>(CometExecIterator.scala:52)
    at org.apache.spark.sql.comet.CometNativeExec.createCometExecIter$1(operators.scala:223)
    at org.apache.spark.sql.comet.CometNativeExec.$anonfun$doExecuteColumnar$6(operators.scala:298)
    at org.apache.spark.sql.comet.ZippedPartitionsRDD.compute(ZippedPartitionsRDD.scala:43)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.comet.CometRuntimeException: Could not find comet-git-info.properties
    at org.apache.comet.package$CometBuildInfo$.<init>(package.scala:57)
    at org.apache.comet.package$CometBuildInfo$.<clinit>(package.scala)
    ... 23 more

Searching in datafusion-comet source code it looks like the error comes from here.

Details of environment:

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

andygrove commented 1 month ago

I confirmed that there is no comet-git-info.properties in the jar file when building from source based on the published instructions.

I propose that we update CometBuildInfo to return defaut values rather than throw an exception if this file is not present.

I suspect that this is the real root cause of https://github.com/apache/datafusion-comet/issues/1012 as well