Open acruise opened 2 months ago
I rebuilt Gluten from HEAD of main as of ~1h ago and get the same result; here's my command line:
/opt/spark/bin/spark-shell \
--jars ~/incubator-gluten/package/target/gluten-velox-bundle-spark3.5_2.12-ubuntu_22.04_x86_64-1.3.0-SNAPSHOT.jar \
--packages org.apache.hadoop:hadoop-aws:3.3.4 \
-c spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
-c spark.plugins=org.apache.gluten.GlutenPlugin \
-c spark.memory.offHeap.enabled=true \
-c spark.memory.offHeap.size=32G
edit: after a few minutes I got this log:
24/09/12 00:05:42 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=0], due to: [FallbackByBackendSettings] Validation failed on node Exchange.
Maybe unrelated: when I Ctrl-C the spark shell, I get this:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007408d6f8555d, pid=325442, tid=325442
#
# JRE version: OpenJDK Runtime Environment (17.0.12+7) (build 17.0.12+7-Ubuntu-1ubuntu222.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.12+7-Ubuntu-1ubuntu222.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libvelox.so+0x558555d] Aws::Http::CleanupHttp()+0x4d
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/alex/incubator-gluten/core.325442)
#
# An error report file with more information is saved as:
# /home/alex/incubator-gluten/hs_err_pid325442.log
#
# If you would like to submit a bug report, please visit:
# https://bugs.launchpad.net/ubuntu/+source/openjdk-17
#
My JVM is:
$ java -version
openjdk version "17.0.12" 2024-07-16
OpenJDK Runtime Environment (build 17.0.12+7-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 17.0.12+7-Ubuntu-1ubuntu222.04, mixed mode, sharing)
Some stack dumps from selected threads I think sound interesting:
There are ~22 threads in ColumnarBatchOutIterator.nativeHasNext
but I haven't checked whether all the stack traces are the same:
"Executor task launch worker for task 21.0 in stage 0.0 (TID 21)" #154 daemon prio=5 os_prio=0 cpu=114.55ms elapsed=8.06s tid=0x00007aa8ac03f050 nid=0x62f8e runnable [0x00007aa7dc5fd000]
java.lang.Thread.State: RUNNABLE
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:61)
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.isEmpty(Iterator.scala:387)
at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
at org.apache.spark.InterruptibleIterator.isEmpty(InterruptibleIterator.scala:28)
at org.apache.gluten.execution.VeloxColumnarToRowExec$.toRowIterator(VeloxColumnarToRowExec.scala:124)
at org.apache.gluten.execution.VeloxColumnarToRowExec.$anonfun$doExecuteInternal$1(VeloxColumnarToRowExec.scala:79)
at org.apache.gluten.execution.VeloxColumnarToRowExec$$Lambda$4616/0x00007aa9bd4e47c0.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:858)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:858)
at org.apache.spark.rdd.RDD$$Lambda$4617/0x00007aa9bd4b3c90.apply(Unknown Source)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$4178/0x00007aa9bd4585a0.apply(Unknown Source)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.12/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.12/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)
And one of these:
"s3a-transfer-alex-dev-testing-unbounded-pool2-t1" #122 daemon prio=5 os_prio=0 cpu=9.87ms elapsed=11.35s tid=0x00007aa794760770 nid=0x62f65 waiting on condition [0x00007aa7de79b000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.12/Native Method)
- parking to wait for <0x00000000c8664748> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.12/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.12/AbstractQueuedSynchronizer.java:1674)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@17.0.12/LinkedBlockingQueue.java:460)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@17.0.12/ThreadPoolExecutor.java:1061)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.12/ThreadPoolExecutor.java:1122)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.12/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)
None of the other stacks contain any frames from an org.apache.gluten
class.
There are eight of these:
"map-output-dispatcher-0" #66 daemon prio=5 os_prio=0 cpu=0.17ms elapsed=20.02s tid=0x00007aaa3721c950 nid=0x62f14 waiting on condition [0x00007aa9ba421000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.12/Native Method)
- parking to wait for <0x00000000c865e7f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.12/LockSupport.java:341)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(java.base@17.0.12/AbstractQueuedSynchronizer.java:506)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.12/ForkJoinPool.java:3465)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.12/ForkJoinPool.java:3436)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@17.0.12/AbstractQueuedSynchronizer.java:1625)
at java.util.concurrent.LinkedBlockingQueue.take(java.base@17.0.12/LinkedBlockingQueue.java:435)
at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:759)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.12/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.12/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)
I'll stop for now. :)
I tried the same query with the same dataset on a local filesystem, and I get the same GlutenFallbackReporter
warning but the query completes successfully. I'll try GCS next to isolate whether S3 per se is an issue.
On GCS, I get this log, even though I've configured gcloud auth according to the doc.
W0912 18:17:40.602993 411440 GCSFileSystem.cpp:303] Config::gcsCredentials is empty
Without the Gluten jar and plugin on the spark command line, this query is successful reading from GCS.
Hi @acruise, for the GCS run, is it running on a GCE VM under the same project as the GCS bucket? If not, you would have to provide a service account json key file to get around the auth.
you would have to provide a service account json key file to get around the auth.
I always do, as you can see from my "For GCS" section above. If I was trying to run without credentials, I would expect to get an exception immediately, rather than an eternal wait. :)
you can gdb attach jvm process,and show c++ callstack
Backend
VL (Velox)
Bug description
I have a TPC-DS dataset in ORC format, duplicated on S3 and GCS. On vanilla Spark 3.5.1 on a single node, these queries complete in 1-3 seconds:
S3:
GCS:
With Gluten enabled, initializing the DataFrame is fine, but when I invoke
count()
the expected number of tasks is spawned, but they do nothing at all. I've tried Gluten builds from the 1.2.0 tag, as well as from 73100f49a7.I've tried disabling whole-stage codegen, but it makes no difference.
Spark version
Spark-3.5.1
Spark configurations
For S3:
For GCS:
System information
It's a c5a.8xlarge (64GB, 32 cores, >100G local disk), running Ubuntu 22.04 with nearly all distro updates applied.
Relevant logs
S3:
GCS: