[SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation

jonathantransb commented 7 months ago

Describe the problem you faced

I'm attempting to read a Hudi table on Glue Catalog using SparkSQL with metadata enabled. However, my job appears to hang indefinitely at a certain step. Despite enabling DEBUG logs, I'm unable to find any indications of what may be causing this issue. Notably, this problem only occurs with Hudi tables where clean is the latest action in the timeline.

To Reproduce

Steps to reproduce the behavior:

Create a Hudi table where clean is the latest action in the timeline

Open spark-shell

spark-shell \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.parquet.filterPushdown=true" \
--conf "spark.sql.parquet.mergeSchema=false" \
--conf "spark.speculation=false" \
--conf "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" \
--conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" \
--conf "spark.sql.catalogImplementation=hive" \
--conf "spark.sql.catalog.spark_catalog.type=hive" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
--conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
--conf "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar"

Run spark.sql():

scala> spark.sql("SET hoodie.metadata.enable=true")
scala> spark.sql("SELECT * FROM <schema>.<table> LIMIT 50").show()

Expected behavior

Spark job can read the table without hanging

Environment Description

Hudi version : 0.14.0
Spark version : 3.4.1
Hive version : 2.3.9
Hadoop version : 3.3.6
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes

Additional context

I encountered no issues while using Hudi version 0.13.1. However, upon trying the new Hudi 0.14.0 version, I experienced this problem.

The driver pod consistently uses up to 1 CPU core, although I'm uncertain about the processes that are running:

For tables where commit is the latest action in the timeline, Hudi 0.14.0 can read the table without any hanging issues.

Stacktrace

23/12/03 12:21:23 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site.xml
23/12/03 12:21:23 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is file:/opt/spark/work-dir/spark-warehouse
23/12/03 12:21:23 INFO AWSGlueClientFactory: Using region from ec2 metadata : ap-southeast-1
23/12/03 12:21:24 INFO AWSGlueClientFactory: Using region from ec2 metadata : ap-southeast-1
23/12/03 12:21:26 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
23/12/03 12:21:26 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
23/12/03 12:21:26 INFO MetricsSystemImpl: s3a-file-system metrics system started
23/12/03 12:21:27 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
23/12/03 12:21:28 INFO DataSourceUtils: Getting table path..
23/12/03 12:21:28 INFO TablePathUtils: Getting table path from path : s3://<bucket>/<schema>/<table>
23/12/03 12:21:28 INFO DefaultSource: Obtained hudi table path: s3://<bucket>/<schema>/<table>
23/12/03 12:21:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://<bucket>/<schema>/<table>
23/12/03 12:21:28 INFO HoodieTableConfig: Loading table properties from s3://<bucket>/<schema>/<table>/.hoodie/hoodie.properties
23/12/03 12:21:28 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://<bucket>/<schema>/<table>
23/12/03 12:21:28 INFO DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
23/12/03 12:21:28 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
23/12/03 12:21:28 INFO TableSchemaResolver: Reading schema from s3://<bucket>/<schema>/<table>/c_day=20231130/cce9afd1-46a1-4668-b8b4-0ac697f1ed57-0_3-21-2099_20231202191150326.parquet
23/12/03 12:21:29 INFO S3AInputStream: Switching to Random IO seek policy
23/12/03 12:21:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://<bucket>/<schema>/<table>
23/12/03 12:21:29 INFO HoodieTableConfig: Loading table properties from s3://<bucket>/<schema>/<table>/.hoodie/hoodie.properties
23/12/03 12:21:29 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://<bucket>/<schema>/<table>
23/12/03 12:21:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://<bucket>/<schema>/<table>/.hoodie/metadata
23/12/03 12:21:29 INFO HoodieTableConfig: Loading table properties from s3://<bucket>/<schema>/<table>/.hoodie/metadata/.hoodie/hoodie.properties
23/12/03 12:21:29 INFO HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from s3://<bucket>/<schema>/<table>/.hoodie/metadata
23/12/03 12:21:29 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__deltacommit__COMPLETED__20231202193207000]}
23/12/03 12:21:29 INFO AbstractTableFileSystemView: Took 2 ms to read  0 instants, 0 replaced file groups
23/12/03 12:21:30 INFO ClusteringUtils: Found 0 files in pending clustering operations
23/12/03 12:21:30 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
23/12/03 12:21:30 INFO BaseHoodieTableFileIndex: Refresh table stg_tracking_unified__click, spent: 365 ms
23/12/03 12:21:30 DEBUG HoodieFileIndex: Unable to compose relative partition path prefix from the predicates; falling back to fetching all partitions
23/12/03 12:21:30 INFO HoodieTableMetadataUtil: Loading latest merged file slices for metadata table partition files
23/12/03 12:21:30 INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
23/12/03 12:21:30 INFO ClusteringUtils: Found 0 files in pending clustering operations
23/12/03 12:21:30 INFO AbstractTableFileSystemView: Building file system view for partition (files)
23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: #files found in partition (files) =15, Time taken =26
23/12/03 12:21:30 DEBUG HoodieTableFileSystemView: Adding file-groups for partition :files, #FileGroups=1
23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: addFilesToView: NumFiles=15, NumFileGroups=1, FileGroupsCreationTime=11, StoreTimeTaken=1
23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: Time to load partition (files) =40
23/12/03 12:21:30 INFO HoodieBackedTableMetadata: Opened metadata base file from s3://<bucket>/<schema>/<table>/.hoodie/metadata/files/files-0000_0-31-2210_20231201193545608001.hfile at instant 20231201193545608001 in 14 ms
23/12/03 12:21:30 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}

[It's stuck at this point. No further logs are printed after this]

ad1happy2go commented 7 months ago

@jonathantransb Thanks for raising this, Sorry for the delay here. In case you tried, Are you facing this issue with OSS hudi and without glue catalog? I will try to check out that and get back to you.

jonathantransb commented 7 months ago

@ad1happy2go Thank you for handling this. I haven't tried the settings without the Glue catalog yet.

ad1happy2go commented 5 months ago

@jonathantransb Sorry for delay here, I missed it. Will work on this and get back to you soon.

ad1happy2go commented 5 months ago

@jonathantransb Are you still facing this issue? Is it possible to hop into a call to understand this better. I did tried jobs with 0.14 but didn't able to reproduce any such issue.

apache / hudi

[SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation #10232