[SUPPORT] Error when running a pipeline after an interrupt

ingridymartinss commented 1 year ago

Describe the problem you faced

Currently, we have a pipeline with approximately 2 billion records and 95 columns that runs every day. Yesterday, at the time of execution, there was an intermittence in the EKS that we have to execute the pipeline (we don't know in which part of the execution, if it was in the inflight part or already in the commit). With that, when we tried to perform a next execution, we had the error below:

Captura de Tela 2023-08-24 às 10 13 32

We tried everything: running low volume, high volume, changing spark resources, and we always got the same error. We also tried to make a savepoint, but when performing the restore we also had errors. In that case, how to proceed?

OBS: I checked the entire log and there are no other errors, just this one, repeating itself several times

To Reproduce

Steps to reproduce the behavior:

Start a pipeline with Hudi 0.12.0 running on EKS
Stop the pipeline abruptly
Error

Expected behavior

Despite the interruption, I expected the pipeline to start working again.

Environment Description

EMR on EKS
Hudi version : 0.12.2
Spark version : 3.2.1
Hive version : -
Hadoop version : -
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Running on EKS

ad1happy2go commented 1 year ago

@ingridymartinss Thanks for raising this. Looks like in the index lookup phase it is returning a location which doesn't exists. Ideally it shouldn't happen. You may need to rebuild your indexes using HoodieIndexer to make it work for now. But this might prove to be a costly operation as dataset is so big.

But we need to identify the root cause for the same. Can you share us the table properties and let us know which writer you are using. (Spark datasource OR deltastreamer OR spark structured streaming)

ingridymartinss commented 1 year ago

@ad1happy2go thanks for response! I'm using Scala Spark, it's a batch pipeline. Captura de Tela 2023-08-24 às 11 44 20

OBS: For company compliance rules I needed to hide table and columns names!

Related with HoodieIndexer: how can I rebuild my indexes?

ad1happy2go commented 1 year ago

@ingridymartinss Sorry for the confusion. As we are using simple index, HoodieIndexer can't be used. Simple index doesn't have any indexes written anywhere.

Can you share us the writer configurations and hoodie timeline during the time when the interruption happened also, I see this is the non partitioned table. Actually we would like to check if there is any cleaner or clustering happened just before that.

Can you also try with hudi 0.12.3.

ingridymartinss commented 1 year ago

Hello @ad1happy2go! Yes, my table does not have any partition column. Unfortunatelly I dont have the hoodie timeline anymore, but we have a class with some default configurations:

    hoodie_datasource_write_operation: Option[String] = "upsert",
    hoodie_bulkinsert_shuffle_parallelism: Option[String] = None,
    hoodie_upsert_shuffle_parallelism: Option[String] = None,
    hoodie_datasource_write_partitionpath_field: Option[String] = None,
    hoodie_parquet_small_file_limit: String = "629145600",
    hoodie_parquet_max_file_size: String = "1073741824",
    hoodie_parquet_block_size: String = "629145600",
    hoodie_copyonwrite_record_size_estimate: String = "1024",
    hoodie_datasource_write_precombine_field: String = "deduplicationColumn",
    hoodie_datasource_write_recordkey_field: String = "columnPK",
    hoodie_datasource_write_keygenerator_class: String = "org.apache.hudi.keygen.SimpleKeyGenerator",
    hoodie_datasource_write_table_name: String = "tableName",
    hoodie_datasource_hivesync_table: String = false,
    hoodie_datasource_hivesync_database: String = false,
    hoodie_datasource_hivesync_enable: String = false,
    hoodie_datasource_hivesync_partitionextractorclass: String = "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    hoodie_datasource_hivesync_partitionfields: Option[String] = None,
    hoodie_datasource_hivesync_jdbcurl: String = s"jdbc:hive2://${hiveUrl}:{port}",
    hoodie_datasource_write_hivestylepartitioning: String = "false",
    hoodie_datasource_write_row_writer_enable: Option[String] = false,
    hoodie_datasource_hivesync_supporttimestamp: String = "true",
    hoodie_clean_async: String = "true",
    hoodie_cleaner_commits_retained: String = "36",
    hoodie_keep_min_commits: String = "37",
    hoodie_keep_max_commits: String = "38",
    hoodie_fail_on_timeline_archiving: String = "false",
    hoodie_datasource_hivesync_autocreatedatabase: String = "true",
    hoodie_bulkinsert_sort_mode: String = "GLOBAL_SORT",
    hoodie_datasource_hivesync_mode: String = "hms"

ad1happy2go commented 1 year ago

@ingridymartinss Sorry for delay here, Were you able to resolve this issue? If yes, please share the insights. If not let us know, I will work on this. Thanks.

ingridymartinss commented 1 year ago

We haven't solved it yet. :(

ad1happy2go commented 1 year ago

Thanks for the response. @ingridymartinss . We will work on this.

ad1happy2go commented 10 months ago

@ingridymartinss Sorry for the delay, but I tried to reproduce this error with the configs you provided, but unable to reproduce it with 1 TB dataset. Also tried to fail the job in between but never got any issue. Can you reproduce this on sample dataset. Can you share some more info which can help me to reproduce this.

apache / hudi

[SUPPORT] Error when running a pipeline after an interrupt #9518