apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.39k stars 2.42k forks source link

[SUPPORT] ETL failure , Caused by: java.io.FileNotFoundException: No such file or directory #4017

Closed veenaypatil closed 2 years ago

veenaypatil commented 2 years ago

Tips before filing an issue

Describe the problem you faced

We are getting the following error in Production for one of the end users ETL's

Caused by: java.io.FileNotFoundException: No such file or directory: s3a://bucket/cdcv2/data/in_ums/user_umfnd_s3/2cf933ef-fe51-4e41-8b0d-af7fa5ed2d85-0_87-19419-8663185_20211116163235.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

We had faced the same issue earlier but we mitigated it by increasing cleaner commits to 120 in spark streaming job which is writing to this location, For reference the spark streaming job has a batch interval of 10 mins where on an avg. the batches are completing in 4 mins and compaction takes 40-50mins which is triggered after 4 commits, so roughly we have around 8hrs of commits.

User Is running the ETL on spark 2.x which is combination of Spark-SQL and Spark-core

To Reproduce

Steps to reproduce the behavior:

  1. We are consistently getting the same error even after retrying the ETL

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

The above configs are of older cluster where the ETL ran. All other ETL's running on Spark3 and using Hive3 are running fine , as mentioned earlier where we had increased the cleaner commits, one of the ETL's had failed on newer cluster as well but post increasing the cleaner commits configs it has not failed on new cluster.

Stacktrace

Caused by: java.io.FileNotFoundException: No such file or directory: s3a://bucket/cdcv2/data/in_ums/user_umfnd_s3/2cf933ef-fe51-4e41-8b0d-af7fa5ed2d85-0_87-19419-8663185_20211116163235.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
veenaypatil commented 2 years ago

cc @vinothchandar @xushiyan

vinothchandar commented 2 years ago

@YannByron Thanks for all the great contributions! do you have any clues here? :)

YannByron commented 2 years ago

@veenaypatil To confirmed, you use KEEP_LATEST_COMMITS as the policy of cleaner, and set CLEANER_COMMITS_RETAINED to 120? Or, you can show the all options about cleaner.

veenaypatil commented 2 years ago

@YannByron that's right, these are the hoodie configs set for the streaming job

hoodieConfigs:
  hoodie.datasource.write.operation: upsert
  hoodie.datasource.write.table.type: MERGE_ON_READ
  hoodie.datasource.write.partitionpath.field: ""
  hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.NonpartitionedKeyGenerator
  hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.NonPartitionedExtractor
  hoodie.parquet.max.file.size: 6110612736
  hoodie.compact.inline: true
  hoodie.compact.inline.max.delta.seconds: 3000
  hoodie.commits.archival.batch: 5
  hoodie.clean.automatic: true
  hoodie.clean.async: true
  hoodie.cleaner.policy: KEEP_LATEST_COMMITS
  hoodie.cleaner.commits.retained: 120
  hoodie.keep.min.commits: 130
  hoodie.keep.max.commits: 131
veenaypatil commented 2 years ago

Once the user migrated the code to Spark3 the ETL is running fine, seems like an issue with Spark2 caching then

xushiyan commented 2 years ago

@veenaypatil which spark 2.x version you used exactly? Hudi supports 2.4+

veenaypatil commented 2 years ago

@xushiyan we were on 2.3.2 version on older cluster, on the new one it is 3.0.2 where it worked. I am closing this issue as the ETL is working is working after migrating to 3.x spark version