apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.14k stars 411 forks source link

[VL] Folly F14Table.h rehashImpl assert failure #5875

Open xingnailu opened 3 months ago

xingnailu commented 3 months ago

Backend

VL (Velox)

Bug description

I am using Gluten(tag v1.1.1) + Velox + folly + spark3.4.2 + yarn,building with centos8 aarch64, running on aarch64 , while yarn container running with reading s3 data, throw Assertion failure: hp.second == srcChunk->tag(srcI)

Spark version

None

Spark configurations

spark.app.attempt.id 1
spark.app.id application_1693383838041_3359264
spark.app.name xxx
spark.app.startTime 1716774272681
spark.app.submitTime 1716774261696
spark.celeborn.master.endpoints cem-0.cem.bigdata.svc.cluster.local:9097
spark.compact.default.filesystem hdfs:/xxxx
spark.compact.smallfile.amount 1000
spark.default.parallelism 800
spark.driver.cores 1
spark.driver.extraJavaOptions -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Ddubbo.application.qos.enable=false -Duser.timeZone=GMT+08 -Dcom.amazonaws.services.s3.enableV4=true -Djava.net.preferIPv4Stack=true -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=2g -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:ConcGCThreads=4 -XX:+UseG1GC -XX:SoftRefLRUPolicyMSPerMB=0 -XX:OnOutOfMemoryError="kill -9 %p" -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -Xloggc:/gc.log -XX:MaxDirectMemorySize=2048m
spark.driver.host 100-64-115-176.bigdata.pod.cluster.local
spark.driver.memory 3g
spark.driver.memoryOverhead 2G
spark.driver.port 44823
spark.dynamicAllocation.enabled false
spark.dynamicAllocation.initialExecutors 0
spark.dynamicAllocation.maxExecutors 200
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.schedulerBacklogTimeout 10s
spark.eventLog.dir s3a://xxx/igdata-sparkhistoryserver/jhs/
spark.eventLog.enabled true
spark.executor.cores 4
spark.executor.extraJavaOptions -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Ddubbo.application.qos.enable=false -Duser.timeZone=GMT+08 -Dcom.amazonaws.services.s3.enableV4=true -Djava.net.preferIPv4Stack=true -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=1g -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:ConcGCThreads=4 -XX:+UseG1GC -XX:SoftRefLRUPolicyMSPerMB=0 -XX:OnOutOfMemoryError="kill -9 %p" -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -Xloggc:/gc.log -XX:MaxDirectMemorySize=3686m
spark.executor.heartbeatInterval 60s
spark.executor.id driver
spark.executor.instances 10
spark.executor.memory 4g
spark.executor.memoryOverhead 4G
spark.executorEnv.PYTHONPATH {{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.9.7-src.zip{{PWD}}/pyspark-3.1.2-20230509.zip{{PWD}}/py4j-0.10.9-src.zip{{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.9-src.zip
spark.gluten.loadLibFromJar true
spark.gluten.memory.conservative.task.offHeap.size.in.bytes 402653184
spark.gluten.memory.offHeap.size.in.bytes 3221225472
spark.gluten.memory.task.offHeap.size.in.bytes 805306368
spark.gluten.sql.session.timeZone.default UTC
spark.hadoop.fs.s3.access.key *****(redacted)
spark.hadoop.fs.s3.connection.ssl.enabled false
spark.hadoop.fs.s3.endpoint s3.ap-southeast-1.amazonaws.com
spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds 2000
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.path.style.access false
spark.hadoop.fs.s3.secret.key *****(redacted)
spark.hadoop.fs.s3a.access.key *****(redacted)
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.connection.ssl.enabled false
spark.hadoop.fs.s3a.endpoint s3.ap-southeast-1.amazonaws.com
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access false
spark.hadoop.fs.s3a.secret.key *****(redacted)
spark.hadoop.fs.s3n.access.key *****(redacted)
spark.hadoop.fs.s3n.connection.ssl.enabled false
spark.hadoop.fs.s3n.endpoint s3.ap-southeast-1.amazonaws.com
spark.hadoop.fs.s3n.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.path.style.access false
spark.hadoop.fs.s3n.secret.key *****(redacted)
spark.hadoop.hive.exec.dynamic.partition.mode nonstrict
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
spark.hadoop.mapreduce.input.fileinputformat.split.minsize 268435456
spark.hadoop.orc.overwrite.output.file true
spark.history.fs.logDirectory s3a:/xxxx/yarn-eks/bigdata-sparkhistoryserver/jhs/
spark.kryoserializer.buffer.max 128m
spark.livy.owner wireless
spark.livy.spark_major_version 3
spark.locality.wait 0s
spark.master yarn
spark.maxRemoteBlockSizeFetchToMem 512m
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 3g
spark.network.timeout 120s

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS | yarnrm1b-0.yarnrm.bigdata.svc.cluster.local,yarnrm1b-1.yarnrm.bigdata.svc.cluster.local spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES | http://yarnrm1b-0.yarnrm.bigdata.svc.cluster.local:8088/proxy/application_1693383838041_3359264,http://yarnrm1b-1.yarnrm.bigdata.svc.cluster.local:8088/proxy/application_1693383838041_3359264 spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.RM_HA_URLS | yarnrm1b-0.yarnrm.bigdata.svc.cluster.local:8088,yarnrm1b-1.yarnrm.bigdata.svc.cluster.local:8088 spark.plugins | io.glutenproject.GlutenPlugin spark.reducer.maxBlocksInFlightPerAddress | 1000 spark.reducer.maxReqsInFlight | 1000 spark.repl.class.outputDir | /data/data1/yarn/nm/usercache/hive/appcache/application_1693383838041_3359264/container_e32_1693383838041_3359264_01_000001/tmp/spark7852604390862141239 spark.repl.class.uri | spark://xxx:44823/classes spark.scheduler.mode | FIFO spark.serializer | org.apache.spark.serializer.KryoSerializer spark.shuffle.consolidateFiles | true spark.shuffle.io.maxRetries | 5 spark.shuffle.io.retryWait | 10 spark.shuffle.manager | org.apache.spark.shuffle.gluten.celeborn.CelebornShuffleManager spark.shuffle.registration.maxAttempts | 5 spark.shuffle.registration.timeout | 120000 spark.shuffle.service.enabled | false spark.shuffle.useOldFetchProtocol | true spark.speculation | true spark.speculation.interval | 10000 spark.speculation.quantile | 0.95 spark.sql.adaptive.advisoryPartitionSizeInBytes | 134217728 spark.sql.adaptive.enabled | true spark.sql.adaptive.localShuffleReader.enabled | false spark.sql.adaptive.shuffle.targetPostShuffleInputSize | 268435456 spark.sql.autoBroadcastJoinThreshold | 20971520 spark.sql.broadcastTimeout | 1200 spark.sql.catalogImplementation | hive spark.sql.compatible.check.enabled | false spark.sql.extensions | io.glutenproject.GlutenSessionExtensions spark.sql.files.maxPartitionBytes | 268435456 spark.sql.files.openCostInBytes | 8388608 spark.sql.files.readParallelism | 1 spark.sql.hive.caseSensitiveInferenceMode | NEVER_INFER spark.sql.legacy.allowHashOnMapType | true spark.sql.legacy.timeParserPolicy | LEGACY spark.sql.mapKeyDedupPolicy | LAST_WIN spark.sql.orc.compression.codec | zlib spark.sql.parquet.fs.optimized.committer.optimization-enabled | false spark.sql.parquet.writeLegacyFormat | true spark.sql.shuffle.partitions | 400 spark.sql.sources.parallelPartitionDiscovery.parallelism | 10 spark.sql.storeAssignmentPolicy | LEGACY spark.storage.decommission.enabled | true spark.storage.decommission.fallbackStorage.path | s3a://bi.oppo.com/yarn-eks/bigdata-default/ spark.storage.decommission.rddBlocks.enabled | true spark.storage.decommission.shuffleBlocks.enabled | true spark.submit.deployMode | cluster spark.ui.filters | org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter spark.ui.port | 0 spark.yarn.am.extraJavaOptions | -Ddubbo.application.qos.enable=false -Dcom.amazonaws.services.s3.enableV4=true -Djava.net.preferIPv4Stack=true -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=2g -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:ConcGCThreads=4 -XX:+UseG1GC -XX:SoftRefLRUPolicyMSPerMB=0 -XX:OnOutOfMemoryError="kill -9 %p" -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -Xloggc:/gc.log spark.yarn.am.memory | 2g spark.yarn.am.waitTime | 600s spark.yarn.app.container.log.dir | /data/data1/yarn/container-logs/application_1693383838041_3359264/container_e32_1693383838041_3359264_01_000001 spark.yarn.app.id | application_1693383838041_3359264 spark.yarn.archive | s3a:/xxxxapp/spark/spark3.4.2-gluten1.1.1-centos8-arm-20240523-1619.zip spark.yarn.dist.archives | s3a://xxx/app/spark/sparkr/sparkr-20230509.zip#sparkr spark.yarn.dist.files | file:///opt/apache-livy-0.7.1-incubating-SNAPSHOT-bin/conf/yarn/hive-site.xml spark.yarn.historyServer.address | sparkhs:18080 spark.yarn.isPython | true spark.yarn.maxAppAttempts | 1 spark.yarn.priority | 4 spark.yarn.queue | root.wireless_sg.daily spark.yarn.secondary.jars | AdsUDF.2.0.0.jar,appstore_exposure_udf-1.0.jar,bdp_udf-1.0-SNAPSHOT.jar,hive_udf-1.0-jar-ip.jar,hive_udf-1.0-jar-position.jar,hive_udf-1.0.jar,hive_udfs-1.0.0.jar,universe_cdo_expose_obj_opt_format.jar,GeoIpParse.jar,kryo-shaded-4.0.2.jar,livy-api-0.7.0-incubating-SNAPSHOT.jar,livy-rsc-0.7.0-incubating-SNAPSHOT.jar,livy-thriftserver-session-0.7.0-incubating-SNAPSHOT.jar,minlog-1.3.0.jar,netty-all-4.1.47.Final.jar,objenesis-2.5.1.jar,commons-codec-1.9.jar,livy-client-common-0.7.0-incubating-SNAPSHOT.jar,livy-core_2.12-0.7.0-incubating-SNAPSHOT.jar,livy-repl_2.12-0.7.0-incubating-SNAPSHOT.jar,datanucleus-rdbms-4.1.19.jar,datanucleus-core-4.1.17.jar,datanucleus-api-jdo-4.2.4.jar spark.yarn.submit.waitAppCompletion | false

System information

Velox System Info v0.0.2 Commit: 8a935d575aa0a38c3324f7ee98c87b576eb7ad70 CMake Version: 3.20.2 System: Linux-4.18.0-240.10.1.el8_3.aarch64 Arch: aarch64 C++ Compiler: /opt/rh/gcc-toolset-10/root/usr/bin/c++ C++ Compiler Version: 10.3.1 C Compiler: /opt/rh/gcc-toolset-10/root/usr/bin/cc C Compiler Version: 10.3.1 CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

\nThe results will be copied to your clipboard if xclip is installed.

Relevant logs

[2024-05-27 01:45:46.687]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 2322107 Aborted                 (core dumped) /usr/lib/jvm/java-1.8.0/bin/java -server -Xmx4096m '-Djava.net.preferIPv6Addresses=false' '-XX:+IgnoreUnrecognizedVMOptions' '--add-opens=java.base/java.lang=ALL-UNNAMED' '--add-opens=java.base/java.lang.invoke=ALL-UNNAMED' '--add-opens=java.base/java.lang.reflect=ALL-UNNAMED' '--add-opens=java.base/java.io=ALL-UNNAMED' '--add-opens=java.base/java.net=ALL-UNNAMED' '--add-opens=java.base/java.nio=ALL-UNNAMED' '--add-opens=java.base/java.util=ALL-UNNAMED' '--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' '--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED' '--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED' '--add-opens=java.base/sun.nio.ch=ALL-UNNAMED' '--add-opens=java.base/sun.nio.cs=ALL-UNNAMED' '--add-opens=java.base/sun.security.action=ALL-UNNAMED' '--add-opens=java.base/sun.util.calendar=ALL-UNNAMED' '--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED' '-Djdk.reflect.useDirectMethodHandle=false' '-Ddubbo.application.qos.enable=false' '-Duser.timeZone=GMT+08' '-Dcom.amazonaws.services.s3.enableV4=true' '-Djava.net.preferIPv4Stack=true' '-XX:MetaspaceSize=512m' '-XX:MaxMetaspaceSize=512m' '-XX:MaxDirectMemorySize=1g' '-XX:+UseCompressedOops' '-XX:ParallelGCThreads=8' '-XX:ConcGCThreads=4' '-XX:+UseG1GC' '-XX:SoftRefLRUPolicyMSPerMB=0' '-XX:OnOutOfMemoryError=kill -9 %p' '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCTimeStamps' '-XX:+PrintGCDateStamps' '-XX:+PrintHeapAtGC' '-Xloggc:/data/data1/yarn/container-logs/application_1693383838041_3359264/container_e32_1693383838041_3359264_01_000003/gc.log' '-XX:MaxDirectMemorySize=3686m' -Djava.io.tmpdir=/data/data1/yarn/nm/usercache/hive/appcache/application_1693383838041_3359264/container_e32_1693383838041_3359264_01_000003/tmp '-Dspark.network.timeout=120s' '-Dspark.driver.port=44823' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/data/data1/yarn/container-logs/application_1693383838041_3359264/container_e32_1693383838041_3359264_01_000003 org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@100-64-115-176.bigdata.pod.cluster.local:44823 --executor-id 2 --hostname xx.bigdata.pod.cluster.local --cores 4 --app-id application_1693383838041_3359264 --resourceProfileId 0 > /data/data1/yarn/container-logs/application_1693383838041_3359264/container_e32_1693383838041_3359264_01_000003/stdout 2> /data/data1/yarn/container-logs/application_1693383838041_3359264/container_e32_1693383838041_3359264_01_000003/stderr
Last 4096 bytes of stderr :
bytes in memory (estimated size 203.7 KiB, free 5.2 GiB)
24/05/27 01:45:35 INFO TorrentBroadcast: Reading broadcast variable 4 took 81 ms
24/05/27 01:45:35 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 612.5 KiB, free 5.2 GiB)
24/05/27 01:45:36 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize

24/05/27 01:45:39 INFO BaseAllocator: Debug mode disabled. Enable with the VM option -Darrow.memory.debug.allocator=true.
24/05/27 01:45:39 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type
24/05/27 01:45:39 INFO CheckAllocator: Using DefaultAllocationManager at memory/DefaultAllocationManagerFactory.class
24/05/27 01:45:40 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
24/05/27 01:45:40 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 33.3 KiB, free 5.2 GiB)
24/05/27 01:45:40 INFO TorrentBroadcast: Reading broadcast variable 1 took 9 ms
24/05/27 01:45:40 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 48.1 KiB, free 5.2 GiB)

Assertion failure: hp.second == srcChunk->tag(srcI)
Message: 
File: /usr/local/include/folly/container/detail/F14Table.h
Line: 2064
Function: rehashImpl
xingnailu commented 3 months ago

@PHILO-HE Please take a look

PHILO-HE commented 2 months ago

@xingnailu, it may be a bug in folly. Could you re-test it with Gluten main branch? The folly version has been upgraded since 1.1.1.