apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 438 forks source link

[VL] Hadoop Kerberos support #2744

Open ziwuse opened 1 year ago

ziwuse commented 1 year ago

Backend

VL (Velox)

Bug description

使用的jar:gluten-velox-bundle-spark3.2_2.12-centos_7-1.0.0.jar

我使用的hdfs集群是开启了kerberos认证的,按照文档所说做了配置 --conf spark.executorEnv.LIBHDFS3_CONF="hdfs-client.xml" --files /path/hdfs-client.xml,/tmp/krb5cc_xxx --conf spark.executorEnv.KRB5CCNAME=krb5cc_xxx

hdfs-client.xml <property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property>

使用yarn client 模式提交一个简单的任务运行失败,希望能给我一些建议和帮助

Reason: Unable to connect to HDFS: ossbHA, got error: HdfsRpcException: Failed to invoke RPC call "getFsStats" on server "xxxxx:8020"  Caused by: HdfsIOException: Cannot initialize client (2): Unknown SASL mechanism.
Retriable: False
Expression: hdfsClient_ != nullptr
Context: Split [file hdfs://ossbHA/data/sparkTest/test.snappy.parquet 0 - 113270130] Task Gluten stage-1 task-1
Top-Level Context: Same as context.
Function: Impl
File: /WorkSpace/gluten/ep/build-velox/build/velox_ep/velox/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.cpp
Line: 48
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox11filesystems14HdfsFileSystemC2ERKSt10shared_ptrIKNS0_6ConfigEERKNS1_19HdfsServiceEndpointE
# 4  _ZN5folly15basic_once_flagINS_15SharedMutexImplILb0EvSt6atomicNS_24SharedMutexPolicyDefaultEEES2_E14call_once_slowIZNK8facebook5velox11filesystemsUlSt10shared_ptrIKNS8_6ConfigEESt17basic_string_viewIcSt11char_traitsIcEEE_clESD_SH_EUlvE_JEEEvOT_DpOT0_
# 5  _ZNK8facebook5velox11filesystemsUlSt10shared_ptrIKNS0_6ConfigEESt17basic_string_viewIcSt11char_traitsIcEEE_clES5_S9_.isra.0
# 6  _ZNSt17_Function_handlerIFSt10shared_ptrIN8facebook5velox11filesystems10FileSystemEES0_IKNS2_6ConfigEESt17basic_string_viewIcSt11char_traitsIcEEENS3_UlS8_SC_E_EE9_M_invokeERKSt9_Any_dataOS8_OSC_
# 7  _ZN8facebook5velox11filesystems13getFileSystemESt17basic_string_viewIcSt11char_traitsIcEESt10shared_ptrIKNS0_6ConfigEE
# 8  _ZN8facebook5velox19FileHandleGeneratorclERKSs
# 9  _ZN8facebook5velox13CachedFactoryISsNS0_10FileHandleENS0_19FileHandleGeneratorENS0_15FileHandleSizerESt8equal_toISsESt4hashISsEE8generateERKSs
# 10 _ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
# 11 _ZN8facebook5velox4exec9TableScan9getOutputEv
# 12 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
# 13 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 14 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 15 _ZN6gluten24WholeStageResultIterator4nextEv
# 16 Java_io_glutenproject_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 17 0x00007fb0e85a5a34

        at io.glutenproject.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
        at io.glutenproject.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:47)
        at io.glutenproject.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
        at io.glutenproject.backendsapi.velox.IteratorHandler$$anon$2.hasNext(IteratorHandler.scala:240)
        at io.glutenproject.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:41)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:104)
        at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:204)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Spark version

spark3.2.1-hadoop2.7

Spark configurations

No response

System information

No response

Relevant logs

No response

rhh777 commented 1 year ago

请问这个问题你解决了吗?我也碰到了。

ziwuse commented 1 year ago

请问这个问题你解决了吗?我也碰到了。

还没有

Stove-hust commented 1 year ago

+1

linyucan-jk commented 1 year ago

这个还没有解决吧,不知道下个版本是几时。。。

rhh777 commented 1 year ago

1.0分支merge这个PR后再试试。https://github.com/oap-project/gluten/pull/1706

ziwuse commented 1 year ago

1.0分支merge这个PR后再试试。#1706

感谢

zhouyuan commented 1 year ago

@ziwuse @rhh777 is HA enabled in your cluster? this patch https://github.com/oap-project/gluten/pull/1706 support DT just as vanilla Spark. it missed 1.0 release.

thanks, -yuan

ziwuse commented 1 year ago

@ziwuse @rhh777 is HA enabled in your cluster? this patch #1706 support DT just as vanilla Spark. it missed 1.0 release.

thanks, -yuan

是HA

zhouyuan commented 1 year ago

这个还没有解决吧,不知道下个版本是几时。。。

@linyucan-jk @ziwuse indeed we missed one release in this September due to resource lack. We plan to have one minor release(1.0.1) by the end of this month.

thanks, -yuan

rhh777 commented 1 year ago

@ziwuse @rhh777 is HA enabled in your cluster? this patch #1706 support DT just as vanilla Spark. it missed 1.0 release.

thanks, -yuan

I have multiple clusters, both HA and non-HA

fyp711 commented 11 months ago

这个还没有解决吧,不知道下个版本是几时。。。

@linyucan-jk @ziwuse indeed we missed one release in this September due to resource lack. We plan to have one minor release(1.0.1) by the end of this month.

thanks, -yuan

Hi @zhouyuan , will 1.0.1 be released today? thanks.

zhouyuan commented 11 months ago

@fyp711 yes, will upload the binary jar today or tmr.

fyp711 commented 11 months ago

@fyp711 yes, will upload the binary jar today or tmr.

That sounds great ! Thans for you reply.