apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.2k stars 434 forks source link

[CH] Clickhouse backend reads HDFS file exception #7285

Open ASiegeLion opened 1 month ago

ASiegeLion commented 1 month ago

Backend

CH (ClickHouse)

Bug description

2024-09-20T10:21:35.670057896+08:00 20. Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 0x0000000005e765d7

2024-09-20T10:21:35.670060237+08:00

2024-09-20T10:21:35.670062920+08:00 at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)

2024-09-20T10:21:35.670065532+08:00 at org.apache.gluten.backendsapi.clickhouse.CollectMetricIterator.hasNext(CHIteratorApi.scala:332)

2024-09-20T10:21:35.670068131+08:00 at org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.$anonfun$hasNext$1(CloseableCHColumnBatchIterator.scala:42)

2024-09-20T10:21:35.670070518+08:00 at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)

2024-09-20T10:21:35.670073013+08:00 at org.apache.gluten.metrics.GlutenTimeMetric$.withNanoTime(GlutenTimeMetric.scala:41)

2024-09-20T10:21:35.670075814+08:00 at org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.hasNext(CloseableCHColumnBatchIterator.scala:42)

2024-09-20T10:21:35.670078261+08:00 at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

2024-09-20T10:21:35.670080648+08:00 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)

2024-09-20T10:21:35.670083225+08:00 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)

2024-09-20T10:21:35.670096914+08:00 at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)

2024-09-20T10:21:35.670105585+08:00 at org.apache.spark.RangePartitioner$.$anonfun$sketch$1(Partitioner.scala:306)

2024-09-20T10:21:35.670113927+08:00 at org.apache.spark.RangePartitioner$.$anonfun$sketch$1$adapted(Partitioner.scala:304)

2024-09-20T10:21:35.670116839+08:00 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)

2024-09-20T10:21:35.670119327+08:00 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)

2024-09-20T10:21:35.670122000+08:00 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

2024-09-20T10:21:35.670124902+08:00 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

2024-09-20T10:21:35.670127818+08:00 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)

2024-09-20T10:21:35.670130570+08:00 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

2024-09-20T10:21:35.670133342+08:00 at org.apache.spark.scheduler.Task.run(Task.scala:131)

2024-09-20T10:21:35.670135847+08:00 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)

2024-09-20T10:21:35.670138536+08:00 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)

2024-09-20T10:21:35.670141095+08:00 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)

2024-09-20T10:21:35.670144107+08:00 ... 3 more

2024-09-20T10:21:35.670150461+08:00 Caused by: org.apache.gluten.exception.GlutenException: Unable to connect to HDFS: HdfsRpcException: RPC channel to "fs-hiido-yycluster01-yynn1.hiido.host.int.yy.com:38020" got protocol mismatch: RPC channel cannot find pending call: id = -33.: While executing SubstraitFileSource

spark executor.log spark-kubernetes-executor.log

sprak driver.log spark-kubernetes-driver.log

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

ASiegeLion commented 1 month ago

gluten conf: image

yhcast0 commented 1 month ago

Got the same problem and solved by making a change to hadoop server side config hadoop.rpc.protection from authentication,privacy to authentication (my hadoop is HDP 3.1.0) img_v3_02fh_ef3ecb88-ffd3-4651-89cc-f0dcfb153dag Looks like disabling of data encryption helps. The root cause is not clear to me, hope someone can explain, thanks.

ASiegeLion commented 4 weeks ago

Got the same problem and solved by making a change to hadoop server side config hadoop.rpc.protection from authentication,privacy to authentication (my hadoop is HDP 3.1.0) img_v3_02fh_ef3ecb88-ffd3-4651-89cc-f0dcfb153dag Looks like disabling of data encryption helps. The root cause is not clear to me, hope someone can explain, thanks.

we modify the configuration image but the exception still exists 264fb2cdbf60055ae2cfd7ad64c18e3 spark-kubernetes-executor.log

hadoop SecurityAuth.audit log image