unable to run Gluten compiled with AOCC derived from clang16

manoj-kumar-tg commented 6 months ago

Backend

VL (Velox)

Bug description

zetta_spark3@dc78cde26b0c:~$ ./spark/bin/spark-sql --master local  --conf spark.executor.instances=3  --conf spark.executor.cores=5 --conf spark.memory.offHeap.enabled=true --conf spark.executor.memory=40g --conf spark.driver.cores=5 --driver-memory 50g   --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=20g --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.gluten.sql.columnar.backend.velox.SplitPreloadPerDriver=0 --database tpcds_100g -f tpcds-data-gen/spark-sql-perf/src/main/resources/tpcds_2_4/q3.sql
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/17 15:02:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/17 15:02:17 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/05/17 15:02:17 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/05/17 15:02:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
24/05/17 15:02:19 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore zetta_spark3@172.17.0.2
24/05/17 15:02:19 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark master: local, Application Id: local-1715938336665
24/05/17 15:02:22 WARN SparkShimProvider: Spark runtime version 3.3.2 is not matched with Gluten's fully tested version 3.3.1
E0517 15:02:23.888530 1561776 Exceptions.h:69] Line: /home/zetta/incubator-gluten-1.1.1/ep/build-velox/build/velox_ep/./velox/common/memory/MemoryPool.h:881, Function:sanityCheckLocked, Expression:  Bad memory usage track state: Memory Pool[op.0.0.0.TableScan LEAF root[WholeStageIterator_root] parent[node.0] MALLOC track-usage thread-safe]<unlimited max capacity unlimited capacity used 16777216.00TB available 2.69KB reservation [used 16777216.00TB, reserved 0B, min 0B] counters [allocs 566, frees 565, reserves 0, releases 1, collisions 0])>, Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Bad memory usage track state: Memory Pool[op.0.0.0.TableScan LEAF root[WholeStageIterator_root] parent[node.0] MALLOC track-usage thread-safe]<unlimited max capacity unlimited capacity used 16777216.00TB available 2.69KB reservation [used 16777216.00TB, reserved 0B, min 0B] counters [allocs 566, frees 565, reserves 0, releases 1, collisions 0])>
Retriable: False
Function: sanityCheckLocked
File: /home/zetta/incubator-gluten-1.1.1/ep/build-velox/build/velox_ep/./velox/common/memory/MemoryPool.h
Line: 881
Stack trace:
# 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)
# 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxRuntimeError, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(facebook::velox::detail::VeloxCheckFailArgs const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 2  facebook::velox::memory::MemoryPoolImpl::releaseThreadSafe(unsigned long, bool)
# 3  facebook::velox::FlatVector<facebook::velox::StringView>::~FlatVector()
# 4  facebook::velox::DictionaryVector<facebook::velox::StringView>::~DictionaryVector()
# 5  facebook::velox::RowVector::~RowVector()
# 6  void __gnu_cxx::new_allocator<gluten::VeloxColumnarBatch>::destroy<gluten::VeloxColumnarBatch>(gluten::VeloxColumnarBatch*)
# 7  std::_Hashtable<long, std::pair<long const, std::shared_ptr<void> >, std::allocator<std::pair<long const, std::shared_ptr<void> > >, std::__detail::_Select1st, std::equal_to<long>, std::hash<long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<void> >, false>*)
# 8  gluten::ObjectStore::release(long)
# 9  Java_io_glutenproject_columnarbatch_ColumnarBatchJniWrapper_close
# 10 0x00007ff1ad017da6
# 11 0x00007ff1ad007fd3
# 12 0x00007ff1ad007fd3
# 13 0x00007ff1ad007fd3
# 14 0x00007ff1ad007fd3
# 15 0x00007ff1ad007fd3
# 16 0x00007ff1ad007d5f
# 17 0x00007ff1aeff030f

Spark version

Spark-3.3.x

Spark configurations

Spark config: (spark.app.name,org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver) (spark.app.submitTime,1716290104384) (spark.driver.cores,5) (spark.driver.memory,50g) (spark.eventLog.dir,file:///home/zetta_spark3/sparkEventLogs) (spark.eventLog.enabled,true) (spark.executor.cores,5) (spark.executor.instances,3) (spark.executor.memory,40g) (spark.executor.memoryOverhead,500m) (spark.hadoop.spark.parquet.binaryAsString,false) (spark.hadoop.spark.sql.caseSensitive,false) (spark.history.fs.logDirectory,file:///home/zetta_spark3/sparkEventLogs) (spark.jars,) (spark.kryoserializer.buffer.max,1g) (spark.master,local) (spark.memory.offHeap.enabled,true) (spark.memory.offHeap.size,20g) (spark.plugins,io.glutenproject.GlutenPlugin) (spark.repl.local.jars,) (spark.serializer,org.apache.spark.serializer.KryoSerializer) (spark.shuffle.manager,org.apache.spark.shuffle.sort.ColumnarShuffleManager) (spark.sql.adaptive.advisoryPartitionSizeInBytes,100M) (spark.sql.autoBroadcastJoinThreshold,200M) (spark.sql.catalogImplementation,hive) (spark.sql.dynamicPartitionPruning.enabled,true) (spark.sql.join.preferSortMergeJoin,false) (spark.sql.legacy.parquet.nanosAsLong,true) (spark.sql.optimizer.dynamicPartitionPruning.enforceBroadcastReuse,true) (spark.sql.shuffle.partitions,400) (spark.submit.deployMode,client) (spark.submit.pyFiles,) (spark.ui.retainedJobs,5000) (spark.ui.retainedStages,5000) (spark.ui.showConsoleProgress,true)

System information

System information zetta@dc78cde26b0c:~$ bash incubator-gluten-1.1.1/dev/info.sh

Velox System Info v0.0.2 Commit: Not in a git repo. CMake Version: 3.22.1 System: Linux-5.15.0-107-generic Arch: x86_64 C++ Compiler: /opt/AMD/aocc-compiler-4.2.0/bin/clang++ C++ Compiler Version: 16.0.3 C Compiler: /opt/AMD/aocc-compiler-4.2.0/bin/clang C Compiler Version: 16.0.3 CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

manoj-kumar-tg commented 6 months ago

FYI ... @PHILO-HE @FelixYBW

FelixYBW commented 6 months ago

With the same parameters, if you build by GCC, can it pass?

Manoj-red-hat commented 6 months ago

Yes, with GCC it works, I am just wondering what configuration needs to be added for Clang build

Manoj-red-hat commented 6 months ago

@FelixYBW if I need to build gluten with clang16.. what steps I should take ?

FelixYBW commented 6 months ago

I'm not sure. We never tried clang16

Manoj-red-hat commented 6 months ago

I'm not sure. We never tried clang16

official we don't support clang , am I right ?

FelixYBW commented 6 months ago

official we don't support clang , am I right ?

Intel Gluten team doesn't use clang and none of our customers required it. Not sure anyone in community has tried.

Manoj-red-hat commented 6 months ago

Intel Gluten team doesn't use clang and none of our customers required it. Not sure anyone in community has tried.

Ok i will do some more experiments around it, will update on this thread 👍

apache / incubator-gluten