apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.39k stars 2.42k forks source link

[SUPPORT] Metaserver read/write errors #9814

Open Limess opened 1 year ago

Limess commented 1 year ago

Describe the problem you faced

see this slack thread, I was told to raise an issue. I don't have a lot of time to debug this as the upgrade isn't essential right now

After upgrading Hudi from 0.12.1 to 0.13.1 via an EMR upgrade I’m seeing a lot of these when using the spark writer:

23/09/25 16:51:57 INFO RemoteHoodieTableFileSystemView: Sending request : (http://ip-10-0-107-14.eu-west-1.compute.internal:38427/v1/hoodie/view/datafiles/beforeoron/latest/?partition=story_published_partition_date%3D2023-08-26&maxinstant=20230925101228159&basepath=s3%3A%2F%2Fprod-signal-articles-store%2Farticles_hudi_copy_on_write&lastinstantts=20230925142837150&timelinehash=839a7f3760bd309b411eecb46f32635c0eb8d06daac3fba349cb7713a6a698c7)
23/09/25 16:52:36 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-107-14.eu-west-1.compute.internal:38427/: The target server failed to respond
23/09/25 16:52:36 INFO RetryExec: Retrying request to {}->http://ip-10-0-107-14.eu-west-1.compute.internal:38427/
23/09/25 16:53:06 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-107-14.eu-west-1.compute.internal:38427/: The target server failed to respond
23/09/25 16:53:06 INFO RetryExec: Retrying request to {}->http://ip-10-0-107-14.eu-west-1.compute.internal:38427/
23/09/25 16:53:36 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-107-14.eu-west-1.compute.internal:38427/: The target server failed to respond
23/09/25 16:53:36 INFO RetryExec: Retrying request to {}->http://ip-10-0-107-14.eu-west-1.compute.internal:38427/
23/09/25 16:54:07 WARN RetryHelper: Catch Exception for Sending request, will retry after 100 ms.
org.apache.hudi.org.apache.http.NoHttpResponseException: ip-10-0-107-14.eu-west-1.compute.internal:38427 failed to respond

I’ve enabled retries, but it seems to be slowing down various write tasks a lot as they retry/fallover to secondary methods. Why would this be happening? Between these, and seemingly slower bloom filter lookups, jobs are taking 2x longer or more.

I'm unsure if these correspond to these warnings on the driver logs:

WARN RequestHandler: Bad request response due to client view behind server view. Last known instant from client was 20230925142837150 but server has the following timeline [[20230405172930640__rollback__COMPLETED], [20230405220408317__rollback__COMPLETED], [20230405230726307__rollback__COMPLETED], [20230406004821619__rollback__COMPLETED], [20230406022626456__rollback__COMPLETED], [20230406040217179__rollback__COMPLETED], [20230406053604634__rollback__COMPLETED], [20230406071500195__rollback__COMPLETED], [20230406085932605__rollback__COMPLETED], [20230406091145473__rollback__COMPLETED], [20230904040946183__rollback__COMPLETED], [20230904200935082__rollback__COMPLETED], [20230905102904696__rollback__COMPLETED], [20230920120910043__commit__COMPLETED], [20230920161015352__commit__COMPLETED], [20230920200916636__commit__COMPLETED], [20230921000922099__commit__COMPLETED], [20230921040951133__commit__COMPLETED], [20230921081133533__commit__COMPLETED], [20230921081136531__clean__COMPLETED], [20230921120938905__commit__COMPLETED], [20230921120941970__clean__COMPLETED], [20230921161019209__commit__COMPLETED], [20230921161022485__clean__COMPLETED], [20230921200920596__commit__COMPLETED], [20230921200923858__clean__COMPLETED], [20230922001011936__commit__COMPLETED], [20230922001014953__clean__COMPLETED], [20230922040943645__commit__COMPLETED], [20230922040946795__clean__COMPLETED], [20230922080911829__commit__COMPLETED], [20230922080915209__clean__COMPLETED], [20230922120928185__commit__COMPLETED], [20230922120931568__clean__COMPLETED], [20230922161014635__commit__COMPLETED], [20230922161017634__clean__COMPLETED], [20230922200911764__commit__COMPLETED], [20230922200914501__clean__COMPLETED], [20230923000928118__commit__COMPLETED], [20230923000931194__clean__COMPLETED], [20230923040937860__commit__COMPLETED], [20230923040940748__clean__COMPLETED], [20230923080919659__commit__COMPLETED], [20230923080922740__clean__COMPLETED], [20230923120913393__commit__COMPLETED], [20230923120916656__clean__COMPLETED], [20230923160937358__commit__COMPLETED], [20230923160940858__clean__COMPLETED], [20230923200914761__commit__COMPLETED], [20230923200917719__clean__COMPLETED], [20230924000958223__commit__COMPLETED], [20230924001001271__clean__COMPLETED], [20230924040915658__commit__COMPLETED], [20230924040918676__clean__COMPLETED], [20230924080919687__commit__COMPLETED], [20230924080922913__clean__COMPLETED], [20230924120907571__commit__COMPLETED], [20230924120910946__clean__COMPLETED], [20230924160910339__commit__COMPLETED], [20230924160913410__clean__COMPLETED], [20230924200912759__commit__COMPLETED], [20230924200915964__clean__COMPLETED], [20230925000926377__commit__COMPLETED], [20230925000931547__clean__COMPLETED], [20230925041024449__commit__COMPLETED], [20230925041027798__clean__COMPLETED], [20230925080953746__commit__COMPLETED], [20230925080957003__clean__COMPLETED], [20230925101228159__commit__COMPLETED], [20230925101231993__clean__COMPLETED], [20230925114607821__clean__COMPLETED], [20230925142837150__rollback__COMPLETED], [20230925161210335__rollback__COMPLETED]]
23/09/25 17:12:41 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20230925161210335__rollback__COMPLETED]}

I’m also seeing similar errors on writes:

Caused by: org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file story_published_partition_date=2023-01-06/47d20ede-bbbe-4cd9-91d1-41993c76752a-0_668-25-96261_20230925161205373.parquet.marker.MERGE
ip-10-0-107-14.eu-west-1.compute.internal:38427 failed to respond

I had to rollback the upgrade as it was causing writes to fail (in addition to the successes taking 2x time)

To Reproduce

Unknown

Expected behavior

The performance to not degrade after upgrading.

Environment Description

Additional context

Upgrading from EMR emr-6.9.0 to emr-6.13.0.

This affected both tables we ingest, write times increased 2x for each cluster when succeeding, and failed for large writes.

EMR config:

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "false"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.default.parallelism": "6712",
      "spark.driver.cores": "4",
      "spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35",
      "spark.driver.memory": "25g",
      "spark.driver.memoryOverhead": "3g",
      "spark.dynamicAllocation.enabled": "false",
      "spark.executor.cores": "4",
      "spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35",
      "spark.executor.instances": "839",
      "spark.executor.memory": "25g",
      "spark.executor.memoryOverhead": "3g",
      "spark.executor.processTreeMetrics.enabled": "true",
      "spark.executorEnv.PEX_INHERIT_PATH": "fallback",
      "spark.kryoserializer.buffer.max": "256m",
      "spark.metrics.namespace": "spark",
      "spark.rdd.compress": "true",
      "spark.scheduler.mode": "FAIR",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.shuffle.service.enabled": "true",
      "spark.sql.adaptive.coalescePartitions.enabled": "true",
      "spark.sql.shuffle.partitions": "6712",
      "spark.task.maxFailures": "10",
      "spark.ui.prometheus.enabled": "true",
      "spark.yarn.appMasterEnv.PEX_INHERIT_PATH": "fallback",
      "spark.yarn.max.executor.failures": "100",
      "spark.yarn.maxAppAttempts": "1"
    }
  },
  {
    "Classification": "spark-log4j2",
    "Properties": {
      "logger.hudi.level": "INFO",
      "logger.hudi.name": "org.apache.hudi"
    }
  },
  {
    "Classification": "spark-metrics",
    "Properties": {
      "*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet",
      "*.sink.prometheusServlet.path": "/metrics/prometheus",
      "applications.sink.prometheusServlet.path": "/metrics/applications/prometheus",
      "driver.source.jvm.class": "org.apache.spark.metrics.source.JvmSource",
      "executor.source.jvm.class": "org.apache.spark.metrics.source.JvmSource",
      "master.sink.prometheusServlet.path": "/metrics/master/prometheus",
      "master.source.jvm.class": "org.apache.spark.metrics.source.JvmSource",
      "worker.source.jvm.class": "org.apache.spark.metrics.source.JvmSource"
    }
  },
  {
    "Classification": "capacity-scheduler",
    "Properties": {
      "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator "
    }
  },
  {
    "Classification": "yarn-site",
    "Properties": {
      "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage": "99.0",
      "yarn.nodemanager.pmem-check-enabled": "false",
      "yarn.nodemanager.vmem-check-enabled": "false"
    }
  },
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  },
  {
    "Classification": "hdfs-site",
    "Properties": {
      "dfs.replication": "2"
    }
  },
  {
    "Classification": "presto-connector-hive",
    "Properties": {
      "hive.metastore.glue.datacatalog.enabled": "true",
      "hive.parquet.use-column-names": "true"
    }
  },
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  },
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "./data_platform_spark_jobs.pex"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "hadoop-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HADOOP_DATANODE_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7001:/etc/hadoop/conf/hdfs_jmx_config_datanode.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50103",
          "HADOOP_NAMENODE_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7001:/etc/hadoop/conf/hdfs_jmx_config_namenode.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50103"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "yarn-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "YARN_NODEMANAGER_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7005:/etc/hadoop/conf/yarn_jmx_config_node_manager.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50111",
          "YARN_RESOURCEMANAGER_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7005:/etc/hadoop/conf/yarn_jmx_config_resource_manager.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50111"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "hudi-defaults",
    "Properties": {
      "hoodie.archive.async": "true",
      "hoodie.bulkinsert.shuffle.parallelism": "6712",
      "hoodie.bulkinsert.sort.mode": "GLOBAL_SORT",
      "hoodie.clean.async": "true",
      "hoodie.cleaner.commits.retained": "1",
      "hoodie.cleaner.policy.failed.writes": "LAZY",
      "hoodie.datasource.hive_sync.support_timestamp": "true",
      "hoodie.delete.shuffle.parallelism": "6712",
      "hoodie.enable.data.skipping": "true",
      "hoodie.filesystem.operation.retry.enable": "true",
      "hoodie.filesystem.view.remote.retry.enable": "true",
      "hoodie.insert.shuffle.parallelism": "6712",
      "hoodie.metadata.index.bloom.filter.enable": "true",
      "hoodie.metadata.index.column.stats.enable": "true",
      "hoodie.metrics.on": "true",
      "hoodie.metrics.reporter.type": "PROMETHEUS",
      "hoodie.parquet.compression.codec": "snappy",
      "hoodie.parquet.max.file.size": "536870912",
      "hoodie.parquet.small.file.limit": "429496729",
      "hoodie.upsert.shuffle.parallelism": "6712",
      "hoodie.write.concurrency.mode": "optimistic_concurrency_control",
      "hoodie.write.lock.dynamodb.billing_mode": "PAY_PER_REQUEST",
      "hoodie.write.lock.dynamodb.endpoint_url": "dynamodb.eu-west-1.amazonaws.com",
      "hoodie.write.lock.dynamodb.region": "eu-west-1",
      "hoodie.write.lock.dynamodb.table": "data-platform-hudi-locks",
      "hoodie.write.lock.provider": "org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider"
    }
  }
]

Additional Hudi config:

hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.precombine.field=version

hoodie.datasource.write.partitionpath.field=story_published_partition_date
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.write.hive_style_partitioning=true

hoodie.avro.schema.validate=true
hoodie.datasource.write.reconcile.schema=false

hoodie.table.name=${TABLE_NAME}

hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.database=articles
hoodie.datasource.hive_sync.table=${TABLE_NAME}
hoodie.datasource.hive_sync.partition_fields=story_published_partition_date

hoodie.write.lock.dynamodb.partition_key=${TABLE_NAME}

hoodie.bloom.index.prune.by.ranges=false

hoodie.index.type=BLOOM
hoodie.metadata.enable=true

hudi

The hours number seems to have become nonsense above (this is from persistent spark logs on EMR)

Screenshot 2023-10-02 at 09-35-16 write_hudi_table - Details for Job 16 Screenshot 2023-10-02 at 09-35-29 write_hudi_table - Details for Stage 25 (Attempt 0)

Limess commented 3 months ago

After upgrading to 0.14.1 (EMR 7.1.0) this is still occuring. This didn't happen until enabling metadata at which point the timeline server becomes a bottleneck and source of failures.

This only occurs during our "backfill" job where we rewrite most of the table, our incremental loads with small clusters don't exhibit this issue.

Any recommendations to mitigate this? The cluster is pretty large using this config:

  1. Is the driver too small for the cluster size?
  2. Any way we can verify that using logs/metrics?
  3. Would altering hoodie.embed.timeline.server.threads help - it seems this is fixed to 200 threads regardless of anything else, despite the documentation suggesting it's variable.
  4. Would enabling hoodie.embed.timeline.server.async help?
  5. Anything else we can tweak?

From what I can tell based on OS level metrics, the driver node (r6g.4xlarge, driver + 3 executors) is barely doing anything CPU wise? It seems to be using < 10% CPU image

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "false"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.default.parallelism": "3352",
      "spark.driver.cores": "4",
      "spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35",
      "spark.driver.memory": "25g",
      "spark.driver.memoryOverhead": "3g",
      "spark.dynamicAllocation.enabled": "false",
      "spark.executor.cores": "4",
      "spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35",
      "spark.executor.instances": "419",
      "spark.executor.maxNumFailures": "100",
      "spark.executor.memory": "25g",
      "spark.executor.memoryOverhead": "3g",
      "spark.executor.processTreeMetrics.enabled": "true",
      "spark.executorEnv.PEX_INHERIT_PATH": "fallback",
      "spark.hadoop.fs.s3.connection.maximum": "1000",
      "spark.hadoop.fs.s3a.connection.maximum": "1000",
      "spark.kryoserializer.buffer.max": "256m",
      "spark.metrics.namespace": "spark",
      "spark.rdd.compress": "true",
      "spark.scheduler.mode": "FAIR",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.shuffle.service.enabled": "true",
      "spark.sql.adaptive.coalescePartitions.enabled": "true",
      "spark.sql.shuffle.partitions": "3352",
      "spark.task.maxFailures": "10",
      "spark.ui.prometheus.enabled": "true",
      "spark.yarn.appMasterEnv.PEX_INHERIT_PATH": "fallback",
      "spark.yarn.maxAppAttempts": "1"
    }
  },
  {
    "Classification": "spark-log4j2",
    "Properties": {
      "logger.hudi.level": "INFO",
      "logger.hudi.name": "org.apache.hudi"
    }
  },
  {
    "Classification": "spark-metrics",
    "Properties": {
      "*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet",
      "*.sink.prometheusServlet.path": "/metrics/prometheus",
      "applications.sink.prometheusServlet.path": "/metrics/applications/prometheus",
      "driver.source.jvm.class": "org.apache.spark.metrics.source.JvmSource",
      "executor.source.jvm.class": "org.apache.spark.metrics.source.JvmSource",
      "master.sink.prometheusServlet.path": "/metrics/master/prometheus",
      "master.source.jvm.class": "org.apache.spark.metrics.source.JvmSource",
      "worker.source.jvm.class": "org.apache.spark.metrics.source.JvmSource"
    }
  },
  {
    "Classification": "capacity-scheduler",
    "Properties": {
      "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator "
    }
  },
  {
    "Classification": "yarn-site",
    "Properties": {
      "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage": "99.0",
      "yarn.nodemanager.pmem-check-enabled": "false",
      "yarn.nodemanager.vmem-check-enabled": "false"
    }
  },
  {
    "Classification": "emrfs-site",
    "Properties": {
      "fs.s3.maxConnections": "1000"
    }
  },
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  },
  {
    "Classification": "hdfs-site",
    "Properties": {
      "dfs.replication": "2"
    }
  },
  {
    "Classification": "presto-connector-hive",
    "Properties": {
      "hive.metastore.glue.datacatalog.enabled": "true",
      "hive.parquet.use-column-names": "true"
    }
  },
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  },
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "./data_platform_spark_jobs.pex"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "hadoop-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HADOOP_DATANODE_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7001:/etc/hadoop/conf/hdfs_jmx_config_datanode.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50103",
          "HADOOP_NAMENODE_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7001:/etc/hadoop/conf/hdfs_jmx_config_namenode.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50103"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "yarn-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "YARN_NODEMANAGER_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7005:/etc/hadoop/conf/yarn_jmx_config_node_manager.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50111",
          "YARN_RESOURCEMANAGER_OPTS": "-javaagent:/etc/prometheus/jmx_prometheus_javaagent.jar=7005:/etc/hadoop/conf/yarn_jmx_config_resource_manager.yaml -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=50111"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "hudi-defaults",
    "Properties": {
      "hoodie.archive.async": "true",
      "hoodie.bulkinsert.sort.mode": "GLOBAL_SORT",
      "hoodie.clean.async": "true",
      "hoodie.cleaner.commits.retained": "1",
      "hoodie.cleaner.policy.failed.writes": "LAZY",
      "hoodie.datasource.hive_sync.support_timestamp": "true",
      "hoodie.datasource.meta.sync.glue.metadata_file_listing": "true",
      "hoodie.enable.data.skipping": "true",
      "hoodie.filesystem.view.remote.retry.enable": "true",
      "hoodie.keep.max.commits": "15",
      "hoodie.keep.min.commits": "10",
      "hoodie.metadata.index.bloom.filter.enable": "true",
      "hoodie.metadata.index.column.stats.enable": "true",
      "hoodie.metrics.on": "true",
      "hoodie.metrics.reporter.type": "PROMETHEUS",
      "hoodie.parquet.compression.codec": "snappy",
      "hoodie.parquet.max.file.size": "536870912",
      "hoodie.parquet.small.file.limit": "429496729",
      "hoodie.write.concurrency.early.conflict.detection.enable": "true",
      "hoodie.write.concurrency.mode": "optimistic_concurrency_control",
      "hoodie.write.lock.dynamodb.billing_mode": "PAY_PER_REQUEST",
      "hoodie.write.lock.dynamodb.endpoint_url": "dynamodb.eu-west-1.amazonaws.com",
      "hoodie.write.lock.dynamodb.region": "eu-west-1",
      "hoodie.write.lock.dynamodb.table": "data-platform-hudi-locks",
      "hoodie.write.lock.provider": "org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider"
    }
  }
]

with some additional dataset specific config:

hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.precombine.field=version

hoodie.datasource.write.partitionpath.field=story_published_partition_date
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.write.hive_style_partitioning=true

hoodie.avro.schema.validate=true
hoodie.datasource.write.reconcile.schema=false

hoodie.table.name=${TABLE_NAME}

hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.database=articles
hoodie.datasource.hive_sync.table=${TABLE_NAME}
hoodie.datasource.hive_sync.partition_fields=story_published_partition_date

hoodie.write.lock.dynamodb.partition_key=${TABLE_NAME}

# as the record key is random, don't try to prune by ranges
hoodie.bloom.index.prune.by.ranges=false

hoodie.index.type=RECORD_INDEX
hoodie.metadata.enable=true
hoodie.metadata.record.index.enable=true
Limess commented 3 months ago

Example of logs on 0.14.1

24/07/08 03:56:35 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431: The target server failed to respond
24/07/08 03:56:35 INFO RetryExec: Retrying request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431
24/07/08 03:56:35 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431: The target server failed to respond
24/07/08 03:56:35 INFO RetryExec: Retrying request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431
24/07/08 03:56:35 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431: The target server failed to respond
24/07/08 03:56:35 INFO RetryExec: Retrying request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431
24/07/08 03:56:55 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431: The target server failed to respond
24/07/08 03:56:55 INFO RetryExec: Retrying request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431
24/07/08 03:57:34 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431: The target server failed to respond
24/07/08 03:57:34 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431: The target server failed to respond
24/07/08 03:57:34 INFO RetryExec: I/O exception (org.apache.hudi.org.apache.http.NoHttpResponseException) caught when processing request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431: The target server failed to respond
24/07/08 03:57:34 INFO RetryExec: Retrying request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431
24/07/08 03:57:34 INFO RetryExec: Retrying request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431
24/07/08 03:57:34 INFO RetryExec: Retrying request to {}->http://ip-10-0-100-87.eu-west-1.compute.internal:39431
24/07/08 04:01:55 WARN RetryHelper: Catch Exception for Sending request, will retry after 219 ms.
java.net.SocketTimeoutException: Read timed out
    at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:288) ~[?:?]
    at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314) ~[?:?]
    at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355) ~[?:?]
    at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808) ~[?:?]
    at java.net.Socket$SocketInputStream.read(Socket.java:966) ~[?:?]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.internalExecute(Request.java:173) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:177) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.lambda$executeRequest$a89da1c0$1(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.util.RetryHelper.start(RetryHelper.java:84) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFile(RemoteHoodieTableFileSystemView.java:618) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:100) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFile(PriorityBasedFileSystemView.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:362) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:257) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1618) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1528) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1592) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:143) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:840) [?:?]
24/07/08 04:02:34 WARN RetryHelper: Catch Exception for Sending request, will retry after 254 ms.
java.net.SocketTimeoutException: Read timed out
    at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:288) ~[?:?]
    at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314) ~[?:?]
    at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355) ~[?:?]
    at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808) ~[?:?]
    at java.net.Socket$SocketInputStream.read(Socket.java:966) ~[?:?]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.internalExecute(Request.java:173) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:177) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.lambda$executeRequest$a89da1c0$1(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.util.RetryHelper.start(RetryHelper.java:84) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFile(RemoteHoodieTableFileSystemView.java:618) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:100) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFile(PriorityBasedFileSystemView.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:362) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:257) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1618) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1528) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1592) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:143) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:840) [?:?]
24/07/08 04:02:34 ERROR RetryHelper: Still failed to Sending request after retried 3 times.
java.net.SocketTimeoutException: Read timed out
    at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:288) ~[?:?]
    at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314) ~[?:?]
    at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355) ~[?:?]
    at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808) ~[?:?]
    at java.net.Socket$SocketInputStream.read(Socket.java:966) ~[?:?]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.internalExecute(Request.java:173) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:177) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.lambda$executeRequest$a89da1c0$1(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.util.RetryHelper.start(RetryHelper.java:84) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFile(RemoteHoodieTableFileSystemView.java:618) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:100) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFile(PriorityBasedFileSystemView.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:362) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:257) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1618) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1528) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1592) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:143) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:840) [?:?]
24/07/08 04:02:34 ERROR PriorityBasedFileSystemView: Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Read timed out
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFile(RemoteHoodieTableFileSystemView.java:622) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:100) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFile(PriorityBasedFileSystemView.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:362) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:257) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1618) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1528) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1592) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:143) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) [spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: java.net.SocketTimeoutException: Read timed out
    at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:288) ~[?:?]
    at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314) ~[?:?]
    at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355) ~[?:?]
    at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808) ~[?:?]
    at java.net.Socket$SocketInputStream.read(Socket.java:966) ~[?:?]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.internalExecute(Request.java:173) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:177) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.lambda$executeRequest$a89da1c0$1(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.util.RetryHelper.start(RetryHelper.java:84) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFile(RemoteHoodieTableFileSystemView.java:618) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    ... 37 more
Limess commented 3 months ago

Tried halving the cluster size and still seeing the same errors and retries at a high volume (potentially a relatively lower volume but still causing failures)

It seems that the retries work in some cases for earlier stages (or at least fall back to direct access after failing), but aren't applied later when writing marker files, leading to job failures:

24/07/08 12:52:28 ERROR Executor: Exception in task 66.0 in stage 23.0 (TID 63746)
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :66
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:342) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:257) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1618) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1528) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1592) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:143) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) ~[spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) ~[spark-common-utils_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) ~[spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) [spark-core_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file story_published_partition_date=2023-09-20/58c52f71-df6b-4f33-9013-bb2f344ef19a-0_66-23-63746_20240708120437027.parquet.marker.MERGE
ip-10-0-107-246.eu-west-1.compute.internal:42607 failed to respond
    at org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:187) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.createWithEarlyConflictDetection(TimelineServerBasedWriteMarkers.java:160) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.marker.WriteMarkers.create(WriteMarkers.java:93) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieWriteHandle.createMarkerFile(HoodieWriteHandle.java:144) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:198) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:134) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:125) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:68) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:400) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:368) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    ... 33 more
Caused by: org.apache.hudi.org.apache.http.NoHttpResponseException: ip-10-0-107-246.eu-west-1.compute.internal:42607 failed to respond
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.internalExecute(Request.java:173) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:177) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeRequestToTimelineServer(TimelineServerBasedWriteMarkers.java:233) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:184) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.createWithEarlyConflictDetection(TimelineServerBasedWriteMarkers.java:160) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.marker.WriteMarkers.create(WriteMarkers.java:93) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieWriteHandle.createMarkerFile(HoodieWriteHandle.java:144) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:198) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:134) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:125) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:68) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:400) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:368) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335) ~[hudi-spark3.5-bundle_2.12-0.14.1-amzn-0.jar:0.14.1-amzn-0]
    ... 33 more
Limess commented 3 months ago

I tried increasing hoodie.embed.timeline.server.threads to 500 and setting hoodie.embed.timeline.server.async to true but had the same issue.

danny0405 commented 3 months ago

It looks like your cluster network is a bottleneck, the default timeline server is a Http based web-server with a local fs view server as a fallback, did you try to disable the remote server totally and just use the local server instead?

Limess commented 3 months ago

The network doesn't seem to be saturated (this is r6g.4xlarge instances).

It looks like your cluster network is a bottleneck, the default timeline server is a Http based web-server with a local fs view server as a fallback, did you try to disable the remove server totally and just use the local server instead?

How would I go about doing that? Is that "hoodie.embed.timeline.server": "false"?

Limess commented 3 months ago

Setting "hoodie.embed.timeline.server": "false" seems to have fixed the performance issues/failures.

I'd still like some recommendations on the downsides, and if there are strong downsides, what we can do to resolve the original issue.

danny0405 commented 3 months ago

cc @yihua for visibility.