apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.34k stars 2.42k forks source link

[SUPPORT] Meta sync error when trying to write to s3 bucket #8777

Closed devanshguptatrepp closed 1 year ago

devanshguptatrepp commented 1 year ago

Describe the problem you faced

When trying to write data to hudi, my spark application fails with following error

java.lang.Exception: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at com.trepp.zone.ZoneExecutionHelper.upsert(ZoneExecutionHelper.scala:101) at com.trepp.zone.Presentation.$anonfun$writeHudiObject$1(Presentation.scala:92) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at com.trepp.zone.Presentation.writeHudiObject(Presentation.scala:81)

I am reading data from an Amazon S3 bucket and doing some transformation before writing data to hudi. And I am using hudi-spark-bundle_2.12-0.11.0.jar available on maven central for Scala 12

Expected behavior

Successfully able to write to Hudi location

Environment Description

Scala version : 2.12.15

Hudi version : 0.11.0

Spark version : 3.2.1

Hadoop version : 3.2.1

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : no

Running on AWS EMR version: 6.7.0

Additional context

Spark submit command:

  1. [Failing Script]: spark-submit --deploy-mode client --jars s3a://bucketpath/hudi/etl/hudi-spark-bundle_2.12-0.11.0.jar --driver-memory 6g --executor-memory 6g --executor-cores 4 --class com.trepp.TreppClient --master yarn --conf spark.files.maxPartitionBytes=268435456 --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.dynamicAllocation.enabled=true s3://bucketpath/etl/2.0-SNAPSHOT/etl-2.0-SNAPSHOT-jar-with-dependencies.jar -z presentation -a dataload -b bucketpath -k config/presentationzone/clo/goldenSetHoldings.json -w overwrite
  2. [Passing Script]: spark-submit --deploy-mode client --jars s3a://bucketpath/hudi/etl/hudi-spark-bundle_2.12-0.11.0.jar --driver-memory 6g --executor-memory 6g --executor-cores 4 --class com.trepp.TreppClient --master yarn --conf spark.files.maxPartitionBytes=268435456 --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.dynamicAllocation.enabled=true s3://bucketpath/etl/2.0-SNAPSHOT/etl-2.0-SNAPSHOT-jar-with-dependencies.jar -z presentation -a dataload -b bucketpath -k config/presentationzone/clo/accountbalances.json -w overwrite

Stacktrace

java.lang.Exception: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at com.trepp.zone.ZoneExecutionHelper.upsert(ZoneExecutionHelper.scala:122) at com.trepp.zone.Presentation.$anonfun$writeHudiObject$1(Presentation.scala:92) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at com.trepp.zone.Presentation.writeHudiObject(Presentation.scala:81) at com.trepp.process.Executor.$anonfun$writeObject$2(Executor.scala:136) at com.trepp.process.Executor.$anonfun$writeObject$2$adapted(Executor.scala:133) at scala.collection.immutable.List.foreach(List.scala:431) at com.trepp.process.Executor.$anonfun$writeObject$1(Executor.scala:133) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at com.trepp.process.Executor.writeObject(Executor.scala:133) at com.trepp.process.Executor$$anon$2.$anonfun$accept$2(Executor.scala:118) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at com.trepp.process.Executor$$anon$2.accept(Executor.scala:113) at com.trepp.process.Executor$$anon$2.accept(Executor.scala:111) at java.util.TreeMap.forEach(TreeMap.java:1005) at com.trepp.process.Executor.executeQuery(Executor.scala:111) at com.trepp.dataload.EtlImpl.$anonfun$executeProcess$3(EtlImpl.scala:43) at scala.util.Try$.apply(Try.scala:213) at com.trepp.dataload.EtlImpl.$anonfun$executeProcess$1(EtlImpl.scala:37) at com.trepp.dataload.EtlImpl.$anonfun$executeProcess$1$adapted(EtlImpl.scala:23) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at com.trepp.dataload.EtlImpl.executeProcess(EtlImpl.scala:23) at com.trepp.TreppClient$.$anonfun$main$1(TreppClient.scala:46) at scala.util.Try$.apply(Try.scala:213) at com.trepp.TreppClient$.main(TreppClient.scala:40) at com.trepp.TreppClient.main(TreppClient.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Here are the list of all the Config. Properties that geys applied: "classname" -> "org.apache.hudi" "hoodie.metadata.enable" -> "true" "hoodie.datasource.write.streaming.ignore.failed.batch" -> "true" "hoodie.populate.meta.fields" -> "true" "hoodie.table.metadata.partitions" -> "files" "hoodie.datasource.hive_sync.schema_string_length_thresh" -> "4000" "hoodie.datasource.hive_sync.use_jdbc" -> "false" "hoodie.write.lock.provider" -> "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider" "hoodie.meta.sync.metadata_file_listing" -> {Boolean@24191} true "hoodie.cleaner.commits.retained" -> "1" "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator" "hoodie.timeline.layout.version" -> "1" "hoodie.datasource.hive_sync.create_managed_table" -> "false" "hoodie.table.checksum" -> "3012481716" "hoodie.datasource.write.precombine.field" -> "dl_change_seq" "hoodie.table.base.file.format" -> "PARQUET" "hoodie.cleaner.policy.failed.writes" -> "EAGER" "hoodie.table.timeline.timezone" -> "LOCAL" "hoodie.datasource.write.recordkey.field" -> "instrumentid,asofperiod" "hoodie.datasource.write.drop.partition.columns" -> "false" "hoodie.datasource.meta.sync.base.path" -> "s3://trepp-developmentservices-lake/presentationZone/clo/clogoldenSetHoldings" "hoodie.datasource.hive_sync.sync_as_datasource" -> "true" "hoodie.datasource.hive_sync.password" -> "hive" "hoodie.datasource.hive_sync.username" -> "hive" "hoodie.clustering.async.enabled" -> "false" "hoodie.payload.ordering.field" -> "dl_change_seq" "hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS" "hoodie.datasource.write.row.writer.enable" -> "true" "hoodie.archivelog.folder" -> "archived" "hoodie.datasource.hive_sync.base_file_format" -> "PARQUET" "hoodie.write.lock.zookeeper.url" -> "ip-10-73-99-147.ec2.internal" "hoodie.datasource.hive_sync.database" -> "presentation_dev" "hoodie.datasource.hive_sync.table" -> "clogoldenSetHoldings" "hoodie.datasource.write.commitmeta.key.prefix" -> "_" "hoodie.table.version" -> "4" "hoodie.table.type" -> "COPY_ON_WRITE" "hoodie.datasource.meta.sync.enable" -> "false" "hoodie.datasource.hive_sync.partition_fields" -> "year,month" "hoodie.table.recordkey.fields" -> "instrumentid,asofperiod" "hoodie.datasource.hive_sync.metastore.uris" -> "thrift://localhost:9083" "hoodie.partition.metafile.use.base.format" -> "false" "hoodie.datasource.write.operation" -> "upsert" "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor" "hoodie.datasource.hive_sync.mode" -> "hms" "hoodie.datasource.write.streaming.retry.interval.ms" -> "2000" "hoodie.write.lock.zookeeper.port" -> "2181" "hoodie.index.type" -> "BLOOM" "hoodie.datasource.write.partitionpath.urlencode" -> "false" "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE" "hoodie.table.partition.fields" -> "year,month" "hoodie.write.concurrency.mode" -> "single_writer" "hoodie.meta_sync.spark.version" -> "3.2.1-amzn-0" "hoodie.database.name" -> "" "hoodie.table.name" -> "clogoldenSetHoldings" "hoodie.datasource.hive_sync.jdbcurl" -> "jdbc:hive2://localhost:10000" "path" -> "s3://trepp-developmentservices-lake/presentationZone/clo/clogoldenSetHoldings" "hoodie.meta.sync.client.tool.class" -> "org.apache.hudi.hive.HiveSyncTool" "hoodie.datasource.write.reconcile.schema" -> "false" "hoodie.clustering.inline" -> "false" "hoodie.datasource.hive_sync.enable" -> "true" "hoodie.datasource.write.streaming.retry.count" -> "3" "hoodie.upsert.shuffle.parallelism" -> "64" "hoodie.table.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator" "hoodie.datasource.compaction.async.enable" -> "true" "hoodie.datasource.write.insert.drop.duplicates" -> "false" "hoodie.write.lock.zookeeper.base_path" -> "/hudi" "hoodie.table.precombine.field" -> "dl_change_seq" "hoodie.datasource.write.partitionpath.field" -> "year,month" "hoodie.datasource.write.payload.class" -> "org.apache.hudi.common.model.OverwriteWithLatestAvroPayload" "hoodie.datasource.write.hive_style_partitioning" -> "true" "hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled" -> "false"

ad1happy2go commented 1 year ago

@devanshguptatrepp Can you please provide the entire stack trace of the error.

devanshguptatrepp commented 1 year ago

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/tez/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 23/05/22 13:07:37 INFO TreppClient$: creating spark context and parameter loadingcom.trepp.TreppClient$ 23/05/22 13:07:37 INFO ApplicationFactory: Application has been found with name dataloadfor classcom.trepp.dataload.EtlImpl@36bed37a 23/05/22 13:07:38 INFO HiveConf: Found configuration file file:/etc/spark/conf.dist/hive-site.xml 23/05/22 13:07:38 INFO SparkContext: Running Spark version 3.2.1-amzn-0 23/05/22 13:07:38 INFO ResourceUtils: ============================================================== 23/05/22 13:07:38 INFO ResourceUtils: No custom resources configured for spark.driver. 23/05/22 13:07:38 INFO ResourceUtils: ============================================================== 23/05/22 13:07:38 INFO SparkContext: Submitted application: com.trepp.TreppClient 23/05/22 13:07:38 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 6144, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 23/05/22 13:07:38 INFO ResourceProfile: Limiting resource is cpus at 4 tasks per executor 23/05/22 13:07:38 INFO ResourceProfileManager: Added ResourceProfile id: 0 23/05/22 13:07:38 INFO SecurityManager: Changing view acls to: hadoop 23/05/22 13:07:38 INFO SecurityManager: Changing modify acls to: hadoop 23/05/22 13:07:38 INFO SecurityManager: Changing view acls groups to: 23/05/22 13:07:38 INFO SecurityManager: Changing modify acls groups to: 23/05/22 13:07:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 23/05/22 13:07:39 INFO Utils: Successfully started service 'sparkDriver' on port 40847. 23/05/22 13:07:39 INFO SparkEnv: Registering MapOutputTracker 23/05/22 13:07:39 INFO SparkEnv: Registering BlockManagerMaster 23/05/22 13:07:39 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 23/05/22 13:07:39 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 23/05/22 13:07:39 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 23/05/22 13:07:39 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-0dbb342c-968b-492d-8427-5fecf7411ac7 23/05/22 13:07:39 INFO MemoryStore: MemoryStore started with capacity 3.0 GiB 23/05/22 13:07:39 INFO SparkEnv: Registering OutputCommitCoordinator 23/05/22 13:07:39 INFO SubResultCacheManager: Sub-result caches are disabled. 23/05/22 13:07:39 INFO log: Logging initialized @19094ms to org.sparkproject.jetty.util.log.Slf4jLog 23/05/22 13:07:39 INFO Server: jetty-9.4.43.v20210629; built: 2021-06-30T11:07:22.254Z; git: 526006ecfa3af7f1a27ef3a288e2bef7ea9dd7e8; jvm 1.8.0_372-b07 23/05/22 13:07:39 INFO Server: Started @19207ms 23/05/22 13:07:39 INFO AbstractConnector: Started ServerConnector@53f7a906{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} 23/05/22 13:07:39 INFO Utils: Successfully started service 'SparkUI' on port 4040. 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@184751f3{/jobs,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2cd3fc29{/jobs/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3513d214{/jobs/job,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@46b5f061{/jobs/job/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@108b121f{/stages,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2ff498b0{/stages/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4300e240{/stages/stage,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@37a67cf{/stages/stage/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5908e6d6{/stages/pool,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2a6fb62f{/stages/pool/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7b44bfb8{/storage,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@98637a2{/storage/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@141aba65{/storage/rdd,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@b55f5b7{/storage/rdd/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b2ef50e{/environment,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4b5ad306{/environment/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@48a46b0f{/executors,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9f9146d{/executors/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@45e7bb79{/executors/threadDump,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@21c75084{/executors/threadDump/json,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@75527e36{/static,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@be6d228{/,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7eee6c13{/api,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2ae5bd34{/jobs/job/kill,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3a16984c{/stages/stage/kill,null,AVAILABLE,@Spark} 23/05/22 13:07:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ip-10-73-103-194.ec2.internal:4040 23/05/22 13:07:39 INFO SparkContext: Added JAR s3://treppsamplebucket/mdm/etl-2.0-SNAPSHOT-jar-with-dependencies.jar at s3://treppsamplebucket/mdm/etl-2.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1684760858447 23/05/22 13:07:39 WARN FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 23/05/22 13:07:39 INFO FairSchedulableBuilder: Created default pool: default, schedulingMode: FIFO, minShare: 0, weight: 1 23/05/22 13:07:40 INFO Utils: Using 50 preallocated executors (minExecutors: 0). Set spark.dynamicAllocation.preallocateExecutors tofalsedisable executor preallocation. 23/05/22 13:07:40 INFO RMProxy: Connecting to ResourceManager at ip-10-73-103-194.ec2.internal/10.73.103.194:8032 23/05/22 13:07:40 INFO Client: Requesting a new application from cluster with 6 NodeManagers 23/05/22 13:07:40 INFO Configuration: resource-types.xml not found 23/05/22 13:07:40 INFO ResourceUtils: Unable to find 'resource-types.xml'. 23/05/22 13:07:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11712 MB per container) 23/05/22 13:07:40 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 23/05/22 13:07:40 INFO Client: Setting up container launch context for our AM 23/05/22 13:07:40 INFO Client: Setting up the launch environment for our AM container 23/05/22 13:07:40 INFO Client: Preparing resources for our AM container 23/05/22 13:07:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 23/05/22 13:07:56 INFO Client: Uploading resource file:/mnt/tmp/spark-a2bd79dd-2460-4f21-aa8a-9e30bd7e24dc/__spark_libs__7560258025190807014.zip -> hdfs://ip-10-73-103-194.ec2.internal:8020/user/hadoop/.sparkStaging/application_1684760745521_0001/__spark_libs__7560258025190807014.zip 23/05/22 13:07:57 INFO Client: Uploading resource s3a://trepp-developmentservices-lake-workspace/binaries/hudi/etl/hudi-spark-bundle_2.12-0.11.0.jar -> hdfs://ip-10-73-103-194.ec2.internal:8020/user/hadoop/.sparkStaging/application_1684760745521_0001/hudi-spark-bundle_2.12-0.11.0.jar 23/05/22 13:07:58 INFO Client: Uploading resource file:/etc/spark/conf.dist/hive-site.xml -> hdfs://ip-10-73-103-194.ec2.internal:8020/user/hadoop/.sparkStaging/application_1684760745521_0001/hive-site.xml 23/05/22 13:07:58 INFO Client: Uploading resource file:/etc/hudi/conf.dist/hudi-defaults.conf -> hdfs://ip-10-73-103-194.ec2.internal:8020/user/hadoop/.sparkStaging/application_1684760745521_0001/hudi-defaults.conf 23/05/22 13:07:58 INFO Client: Uploading resource file:/mnt/tmp/spark-a2bd79dd-2460-4f21-aa8a-9e30bd7e24dc/__spark_conf__4894426805724937859.zip -> hdfs://ip-10-73-103-194.ec2.internal:8020/user/hadoop/.sparkStaging/application_1684760745521_0001/__spark_conf__.zip 23/05/22 13:07:58 INFO SecurityManager: Changing view acls to: hadoop 23/05/22 13:07:58 INFO SecurityManager: Changing modify acls to: hadoop 23/05/22 13:07:58 INFO SecurityManager: Changing view acls groups to: 23/05/22 13:07:58 INFO SecurityManager: Changing modify acls groups to: 23/05/22 13:07:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 23/05/22 13:07:58 INFO Client: Submitting application application_1684760745521_0001 to ResourceManager 23/05/22 13:07:59 INFO YarnClientImpl: Submitted application application_1684760745521_0001 23/05/22 13:08:00 INFO Client: Application report for application_1684760745521_0001 (state: ACCEPTED) 23/05/22 13:08:00 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1684760878895 final status: UNDEFINED tracking URL: http://ip-10-73-103-194.ec2.internal:20888/proxy/application_1684760745521_0001/ user: hadoop 23/05/22 13:08:01 INFO Client: Application report for application_1684760745521_0001 (state: ACCEPTED) 23/05/22 13:08:02 INFO Client: Application report for application_1684760745521_0001 (state: ACCEPTED) 23/05/22 13:08:03 INFO Client: Application report for application_1684760745521_0001 (state: ACCEPTED) 23/05/22 13:08:04 INFO Client: Application report for application_1684760745521_0001 (state: ACCEPTED) 23/05/22 13:08:05 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-73-103-194.ec2.internal, PROXY_URI_BASES -> http://ip-10-73-103-194.ec2.internal:20888/proxy/application_1684760745521_0001), /proxy/application_1684760745521_0001 23/05/22 13:08:05 INFO Client: Application report for application_1684760745521_0001 (state: RUNNING) 23/05/22 13:08:05 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.73.102.168 ApplicationMaster RPC port: -1 queue: default start time: 1684760878895 final status: UNDEFINED tracking URL: http://ip-10-73-103-194.ec2.internal:20888/proxy/application_1684760745521_0001/ user: hadoop 23/05/22 13:08:05 INFO YarnClientSchedulerBackend: Application application_1684760745521_0001 has started running. 23/05/22 13:08:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39825. 23/05/22 13:08:05 INFO NettyBlockTransferService: Server created on ip-10-73-103-194.ec2.internal:39825 23/05/22 13:08:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 23/05/22 13:08:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ip-10-73-103-194.ec2.internal, 39825, None) 23/05/22 13:08:05 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-73-103-194.ec2.internal:39825 with 3.0 GiB RAM, BlockManagerId(driver, ip-10-73-103-194.ec2.internal, 39825, None) 23/05/22 13:08:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ip-10-73-103-194.ec2.internal, 39825, None) 23/05/22 13:08:05 INFO BlockManager: external shuffle service port = 7337 23/05/22 13:08:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ip-10-73-103-194.ec2.internal, 39825, None) 23/05/22 13:08:05 INFO ServerInfo: Adding filter to /metrics/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 23/05/22 13:08:05 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@29592929{/metrics/json,null,AVAILABLE,@Spark} 23/05/22 13:08:05 INFO SingleEventLogFileWriter: Logging events to hdfs:/var/log/spark/apps/application_1684760745521_0001.inprogress 23/05/22 13:08:05 INFO Utils: Using 50 preallocated executors (minExecutors: 0). Set spark.dynamicAllocation.preallocateExecutors tofalsedisable executor preallocation. 23/05/22 13:08:05 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 23/05/22 13:08:05 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 23/05/22 13:08:05 INFO SparkSessionWrapper: Spark Session created sucessfully for environment 23/05/22 13:08:36 WARN HoodieSparkSqlWriter$: hoodie table at s3://trepp-developmentservices-lake/presentationZone/clo/clogoldenSetHoldings already exists. Deleting existing data & overwriting with new data. 23/05/22 13:08:45 WARN HoodieBackedTableMetadata: Metadata table was not found at path s3://trepp-developmentservices-lake/presentationZone/clo/clogoldenSetHoldings/.hoodie/metadata java.lang.Exception: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at com.trepp.zone.ZoneExecutionHelper.upsert(ZoneExecutionHelper.scala:122) at com.trepp.zone.Presentation.$anonfun$writeHudiObject$1(Presentation.scala:92) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at com.trepp.zone.Presentation.writeHudiObject(Presentation.scala:81) at com.trepp.process.Executor.$anonfun$writeObject$2(Executor.scala:136) at com.trepp.process.Executor.$anonfun$writeObject$2$adapted(Executor.scala:133) at scala.collection.immutable.List.foreach(List.scala:431) at com.trepp.process.Executor.$anonfun$writeObject$1(Executor.scala:133) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at com.trepp.process.Executor.writeObject(Executor.scala:133) at com.trepp.process.Executor$$anon$2.$anonfun$accept$2(Executor.scala:118) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at com.trepp.process.Executor$$anon$2.accept(Executor.scala:113) at com.trepp.process.Executor$$anon$2.accept(Executor.scala:111) at java.util.TreeMap.forEach(TreeMap.java:1005) at com.trepp.process.Executor.executeQuery(Executor.scala:111) at com.trepp.dataload.EtlImpl.$anonfun$executeProcess$3(EtlImpl.scala:43) at scala.util.Try$.apply(Try.scala:213) at com.trepp.dataload.EtlImpl.$anonfun$executeProcess$1(EtlImpl.scala:37) at com.trepp.dataload.EtlImpl.$anonfun$executeProcess$1$adapted(EtlImpl.scala:23) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at com.trepp.dataload.EtlImpl.executeProcess(EtlImpl.scala:23) at com.trepp.TreppClient$.$anonfun$main$1(TreppClient.scala:46) at scala.util.Try$.apply(Try.scala:213) at com.trepp.TreppClient$.main(TreppClient.scala:40) at com.trepp.TreppClient.main(TreppClient.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 23/05/22 13:09:35 ERROR Presentation: Failed in writing data to locations3://trepp-developmentservices-lake/presentationZone/clo/()

devanshguptatrepp commented 1 year ago

Unrelated issue, The table name was causing the above error.