Open ingridymartinss opened 1 year ago
@ingridymartinss Thanks for raising this. Looks like in the index lookup phase it is returning a location which doesn't exists. Ideally it shouldn't happen. You may need to rebuild your indexes using HoodieIndexer to make it work for now. But this might prove to be a costly operation as dataset is so big.
But we need to identify the root cause for the same. Can you share us the table properties and let us know which writer you are using. (Spark datasource OR deltastreamer OR spark structured streaming)
@ad1happy2go thanks for response! I'm using Scala Spark, it's a batch pipeline.
OBS: For company compliance rules I needed to hide table and columns names!
Related with HoodieIndexer: how can I rebuild my indexes?
@ingridymartinss Sorry for the confusion. As we are using simple index, HoodieIndexer can't be used. Simple index doesn't have any indexes written anywhere.
Can you share us the writer configurations and hoodie timeline during the time when the interruption happened also, I see this is the non partitioned table. Actually we would like to check if there is any cleaner or clustering happened just before that.
Can you also try with hudi 0.12.3.
Hello @ad1happy2go! Yes, my table does not have any partition column. Unfortunatelly I dont have the hoodie timeline anymore, but we have a class with some default configurations:
hoodie_datasource_write_operation: Option[String] = "upsert",
hoodie_bulkinsert_shuffle_parallelism: Option[String] = None,
hoodie_upsert_shuffle_parallelism: Option[String] = None,
hoodie_datasource_write_partitionpath_field: Option[String] = None,
hoodie_parquet_small_file_limit: String = "629145600",
hoodie_parquet_max_file_size: String = "1073741824",
hoodie_parquet_block_size: String = "629145600",
hoodie_copyonwrite_record_size_estimate: String = "1024",
hoodie_datasource_write_precombine_field: String = "deduplicationColumn",
hoodie_datasource_write_recordkey_field: String = "columnPK",
hoodie_datasource_write_keygenerator_class: String = "org.apache.hudi.keygen.SimpleKeyGenerator",
hoodie_datasource_write_table_name: String = "tableName",
hoodie_datasource_hivesync_table: String = false,
hoodie_datasource_hivesync_database: String = false,
hoodie_datasource_hivesync_enable: String = false,
hoodie_datasource_hivesync_partitionextractorclass: String = "org.apache.hudi.hive.MultiPartKeysValueExtractor",
hoodie_datasource_hivesync_partitionfields: Option[String] = None,
hoodie_datasource_hivesync_jdbcurl: String = s"jdbc:hive2://${hiveUrl}:{port}",
hoodie_datasource_write_hivestylepartitioning: String = "false",
hoodie_datasource_write_row_writer_enable: Option[String] = false,
hoodie_datasource_hivesync_supporttimestamp: String = "true",
hoodie_clean_async: String = "true",
hoodie_cleaner_commits_retained: String = "36",
hoodie_keep_min_commits: String = "37",
hoodie_keep_max_commits: String = "38",
hoodie_fail_on_timeline_archiving: String = "false",
hoodie_datasource_hivesync_autocreatedatabase: String = "true",
hoodie_bulkinsert_sort_mode: String = "GLOBAL_SORT",
hoodie_datasource_hivesync_mode: String = "hms"
@ingridymartinss Sorry for delay here, Were you able to resolve this issue? If yes, please share the insights. If not let us know, I will work on this. Thanks.
We haven't solved it yet. :(
Thanks for the response. @ingridymartinss . We will work on this.
@ingridymartinss Sorry for the delay, but I tried to reproduce this error with the configs you provided, but unable to reproduce it with 1 TB dataset. Also tried to fail the job in between but never got any issue. Can you reproduce this on sample dataset. Can you share some more info which can help me to reproduce this.
Describe the problem you faced
Currently, we have a pipeline with approximately 2 billion records and 95 columns that runs every day. Yesterday, at the time of execution, there was an intermittence in the EKS that we have to execute the pipeline (we don't know in which part of the execution, if it was in the inflight part or already in the commit). With that, when we tried to perform a next execution, we had the error below:
We tried everything: running low volume, high volume, changing spark resources, and we always got the same error. We also tried to make a savepoint, but when performing the restore we also had errors. In that case, how to proceed?
OBS: I checked the entire log and there are no other errors, just this one, repeating itself several times
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Despite the interruption, I expected the pipeline to start working again.
Environment Description
EMR on EKS
Hudi version : 0.12.2
Spark version : 3.2.1
Hive version : -
Hadoop version : -
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Running on EKS