Closed soumilshah1995 closed 1 year ago
@soumilshah1995 Can you please share the full stacktrace? Also, I don't see this required config being set https://hudi.apache.org/docs/configurations#hoodiewritelockdynamodbendpoint_url
@soumilshah1995 Can you please share the full stacktrace? Also, I don't see this required config being set https://hudi.apache.org/docs/configurations#hoodiewritelockdynamodbendpoint_url
Bro I don't have logs, but I tried a lot and it doesn't function as described; you can test it yourself if necessary. By the way i am a youtuber and i really love Apache Hudi great job on that and thank you for making it open source
@soumilshah1995 : can you share the driver logs you see the full stacktrace. we wanted to see what exception is seen.
@codope let's try to reproduce it with the code snippet. noticed that the reported version is 0.9.0. we can see if things working fine with 0.12.x. if yes, we can share the working setup with @soumilshah1995 here.
@nsivabalan confirmed that it was not reproducible with 0.12.1 in our Jenkins test infra.
s 0.9.0
Did you guys try did it work for you i am very curious ? i am attaching more details shortly :D logs and info
@codope and @nsivabalan
here are more details
<html>
<body>
<!--StartFragment-->
Message
--
No older events at this moment. Retry
Preparing ...
Wed Dec 7 15:58:22 UTC 2022
/usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=2 --conf spark.executor.memory=10g --conf spark.executor.cores=8 --conf spark.driver.memory=10g --conf spark.default.parallelism=24 --conf spark.sql.shuffle.partitions=24 --conf spark.network.timeout=600 --enable-glue-datacatalog true --job-bookmark-option job-bookmark-enable --TempDir s3://aws-glue-assets-043916019468-us-west-2/temporary/ --extra-jars s3://glue-learn-begineers/jar/spark-avro_2.12-3.0.1.jar,s3://glue-learn-begineers/jar/hudi-spark3-bundle_2.12-0.9.0.jar --JOB_ID j_8cd1407036ab38dae74c05008aba629c7beab90ad673660ec0184b6fb1bcef49 --enable-metrics true --enable-spark-ui true --spark-event-logs-path s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ --enable-job-insights true --JOB_RUN_ID jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df --additional-python-modules faker==11.3.0 --enable-continuous-cloudwatch-log true --scriptLocation s3://aws-glue-assets-043916019468-us-west-2/scripts/hudi-test.py --job-language python --JOB_NAME hudi-test
1670428703170
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Glue analyzer rules s3://aws-glue-jes-prod-us-west-2-assets/etl/analysis-rules/glue3/
INFO 2022-12-07 15:58:25,461 1 com.amazonaws.services.glue.utils.AWSClientUtils$ [main] AWSClientUtils: create aws log client with conf: proxy host null, proxy port -1
INFO 2022-12-07 15:58:25,689 229 com.amazonaws.services.glue.utils.AWSClientUtils$ [main] AWSClientUtils: getGlueClient. proxy host: null , port: -1
GlueTelemetry: Current region us-west-2
GlueTelemetry: Glue Endpoint https://glue.us-west-2.amazonaws.com
GlueTelemetry: Prod env...falseGlueTelemetry: is disabled
Glue ETL Marketplace - Start ETL connector activation process...Glue ETL Marketplace - no connections are attached to job hudi-test, no need to download custom connector jar for it.
Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues.Glue ETL Marketplace - ETL connector activation process finished, container setup continues...
Download bucket: glue-learn-begineers key: jar/spark-avro_2.12-3.0.1.jar with usingProxy: false
Download bucket: glue-learn-begineers key: jar/hudi-spark3-bundle_2.12-0.9.0.jar with usingProxy: false
Download bucket: aws-glue-assets-043916019468-us-west-2 key: scripts/hudi-test.py with usingProxy: false
INFO 2022-12-07 15:58:27,947 2487 com.amazonaws.services.glue.PythonModuleInstaller [main] pip3 install --user faker==11.3.0
INFO 2022-12-07 15:58:35,156 9696 com.amazonaws.services.glue.PythonModuleInstaller [main] Collecting faker==11.3.0 Downloading Faker-11.3.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 26.4 MB/s eta 0:00:00Requirement already satisfied: typing-extensions>=3.10.0.2 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (4.4.0)Collecting text-unidecode==1.3 Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.2/78.2 kB 17.4 MB/s eta 0:00:00Requirement already satisfied: python-dateutil>=2.4 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (2.8.2)Requirement already satisfied: six>=1.5 in /home/spark/.local/lib/python3.7/site-packages (from python-dateutil>=2.4->faker==11.3.0) (1.16.0)Installing collected packages: text-unidecode, fakerSuccessfully installed faker-11.3.0 text-unidecode-1.3
INFO 2022-12-07 15:58:35,156 9696 com.amazonaws.services.glue.PythonModuleInstaller [main] WARNING: The script faker is installed in '/home/spark/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Launching ...
Wed Dec 7 15:58:35 UTC 2022
/usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar:/tmp/* -Dlog4j.configuration=log4j -server -Xmx10g -XX:+UseG1GC -XX:MaxHeapFreeRatio=70 -XX:InitiatingHeapOccupancyPercent=45 -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseCompressedOops -Djavax.net.ssl.trustStore=/opt/amazon/certs/ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=/opt/amazon/certs/rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=/opt/amazon/certs/redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:/opt/amazon/certs/RDSTrustStore.jks -Dspark.network.timeout=600 -Dspark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource -Dspark.eventLog.enabled=true -Dspark.driver.cores=4 -Dspark.metrics.conf.*.source.system.class=org.apache.spark.metrics.source.SystemMetricsSource -Dspark.glue.endpoint=https://glue-jes.us-west-2.amazonaws.com -Dspark.default.parallelism=8 -Dspark.sql.parquet.output.committer.class=com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter -Dspark.cloudwatch.logging.conf.jobRunId=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.glue.enable-job-insights=true -Dspark.glueExceptionAnalysisEventLog.dir=/tmp/glue-exception-analysis-logs/ -Dspark.hadoop.mapred.output.direct.EmrFileSystem=true -Dspark.hadoop.aws.glue.endpoint=https://glue.us-west-2.amazonaws.com -Dspark.glueJobInsights.enabled=true -Dspark.glue.USE_PROXY=false -Dspark.eventLog.dir=/tmp/spark-event-logs/ -Dspark.extraListeners=com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener -Dspark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem -Dspark.authenticate.secret=<HIDDEN> -Dspark.glue.additional-python-modules=faker==11.3.0 -Dspark.metrics.conf.*.sink.GlueCloudwatch.jobRunId=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.hadoop.mapred.output.direct.NativeS3FileSystem=true -Dspark.pyspark.python=/usr/bin/python3 -Dspark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory -Dspark.cloudwatch.logging.ui.showConsoleProgress=true -Dspark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 -Dspark.hadoop.glue.michiganCredentialsProviderProxy=com.amazonaws.services.glue.remote.LakeformationCredentialsProvider -Dspark.metrics.conf.driver.source.aggregate.class=org.apache.spark.metrics.source.AggregateMetricsSource -Dspark.glue.enable-continuous-cloudwatch-log=true -Dspark.glue.JOB_RUN_ID=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.sql.catalogImplementation=hive -Dspark.dynamicAllocation.maxExecutors=2 -Dspark.ui.enabled=false -Dspark.driver.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar -Dspark.sql.shuffle.partitions=8 -Dspark.authenticate=true -Dspark.glue.GLUE_TASK_GROUP_ID=eb5a0158-1b23-4c96-b077-73337a6fd062 -Dspark.dynamicAllocation.enabled=false -Dspark.shuffle.service.enabled=false -Dspark.hadoop.mapred.output.committer.class=org.apache.hadoop.mapred.DirectOutputCommitter -Dspark.executor.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar -Dspark.glue.GLUE_VERSION=3.0 -Dspark.executor.cores=4 -Dspark.hadoop.lakeformation.credentials.url=http://localhost:9998/lakeformationcredentials -Dspark.app.name=nativespark-hudi-test-jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.executor.instances=2 -Dspark.metrics.conf.*.sink.GlueCloudwatch.jobName=hudi-test -Dspark.rpc.askTimeout=600 -Dspark.sql.parquet.fs.optimized.committer.optimization-enabled=true -Dspark.metrics.conf.*.source.s3.class=org.apache.spark.metrics.source.S3FileSystemSource -Dspark.driver.host=172.36.33.155 -Dspark.metrics.conf.*.sink.GlueCloudwatch.namespace=Glue -Dspark.pyFiles= -Dspark.executor.memory=10g -Dspark.metrics.conf.*.sink.GlueCloudwatch.class=org.apache.spark.metrics.sink.GlueCloudwatchSink -Dspark.driver.memory=10g -Dspark.hadoop.fs.s3.buffer.dir=/tmp/hadoop-spark/s3 -Dspark.glue.GLUE_COMMAND_CRITERIA=glueetl -Dspark.master=jes -Dspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false -Dspark.unsafe.sorter.spill.read.ahead.enabled=false -Dspark.hadoop.parquet.enable.summary-metadata=false -Dspark.glueAppInsightsLog.dir=/tmp/glue-app-insights-logs/ -Dspark.glue.jobLanguage=python -Dspark.glue.extra-jars=s3://glue-learn-begineers/jar/spark-avro_2.12-3.0.1.jar,s3://glue-learn-begineers/jar/hudi-spark3-bundle_2.12-0.9.0.jar -Dspark.glue.JOB_NAME=hudi-test -Dspark.script.location=s3://aws-glue-assets-043916019468-us-west-2/scripts/hudi-test.py -Dspark.files.overwrite=true com.amazonaws.services.glue.ProcessLauncher --launch-class org.apache.spark.deploy.PythonRunner /opt/amazon/bin/runscript.py /tmp/hudi-test.py true --job-bookmark-option job-bookmark-enable --JOB_ID j_8cd1407036ab38dae74c05008aba629c7beab90ad673660ec0184b6fb1bcef49 true --spark-event-logs-path s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ --JOB_RUN_ID jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df --JOB_NAME hudi-test --TempDir s3://aws-glue-assets-043916019468-us-west-2/temporary/
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Continuous Logging: Creating cloudwatch appender.
Continuous Logging: Creating cloudwatch appender.
Continuous Logging: Creating cloudwatch appender.
Continuous Logging: Creating cloudwatch appender.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: A number (14) of logging calls during the initialization phase have been intercepted and areSLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
SLF4J: See also http://www.slf4j.org/codes.html#replay
2022-12-07 15:58:38,910 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(56)): Unable to read clusterId from http://localhost:8321/configuration, trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json
2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(63)): Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json
2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(71)): Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look
2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Class path contains multiple SLF4J bindings.
2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2022-12-07 15:58:40,320 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN Please initialize the log4j system properly.2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2022-12-07 15:58:41,386 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): ==== Netty Server Started ====
2022-12-07 15:58:41,389 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): input rulesFilePath=/tmp/glue_app_analyzer_rules
2022-12-07 15:58:41,814 INFO [main] glue.SafeLogging (Logging.scala:logInfo(57)): Initializing logging subsystem
2022-12-07 15:58:43,864 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Running Spark version 3.1.1-amzn-0
2022-12-07 15:58:43,911 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ==============================================================
2022-12-07 15:58:43,912 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): No custom resources configured for spark.driver.
2022-12-07 15:58:43,913 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ==============================================================
2022-12-07 15:58:43,913 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Submitted application: nativespark-hudi-test-jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df
2022-12-07 15:58:43,935 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 10240, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2022-12-07 15:58:43,953 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Limiting resource is cpus at 4 tasks per executor
2022-12-07 15:58:43,956 INFO [Thread-12] resource.ResourceProfileManager (Logging.scala:logInfo(57)): Added ResourceProfile id: 0
2022-12-07 15:58:44,032 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls to: spark
2022-12-07 15:58:44,033 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls to: spark
2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls groups to:
2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls groups to:
2022-12-07 15:58:44,035 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
2022-12-07 15:58:44,361 INFO [Thread-12] util.Utils (Logging.scala:logInfo(57)): Successfully started service 'sparkDriver' on port 34581.
2022-12-07 15:58:44,409 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering MapOutputTracker
2022-12-07 15:58:44,446 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMaster
2022-12-07 15:58:44,480 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2022-12-07 15:58:44,481 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): BlockManagerMasterEndpoint up
2022-12-07 15:58:44,486 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMasterHeartbeat
2022-12-07 15:58:44,508 INFO [Thread-12] storage.DiskBlockManager (Logging.scala:logInfo(57)): Created local directory at /tmp/blockmgr-28d77124-2a0a-4c4f-ae8f-d15612884b2a
2022-12-07 15:58:44,543 INFO [Thread-12] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore started with capacity 5.8 GiB
2022-12-07 15:58:44,566 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering OutputCommitCoordinator
2022-12-07 15:58:44,715 INFO [Thread-12] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): JESAsSchedulerBackendEndpoint
2022-12-07 15:58:44,716 INFO [Thread-12] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): JESSchedulerBackend
2022-12-07 15:58:44,723 INFO [Thread-12] scheduler.JESSchedulerBackend (JESClusterManager.scala:<init>(210)): JESClusterManager: Initializing JES client with proxy: host: null, port: -1
2022-12-07 15:58:44,980 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): creating executor task for executor 1; clientToken gr_eb5a0158-1b23-4c96-b077-73337a6fd062_e_1_a_spark-application-1670428724710
2022-12-07 15:58:44,986 INFO [Thread-12] util.Utils (Logging.scala:logInfo(57)): Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35509.
2022-12-07 15:58:44,986 INFO [Thread-12] netty.NettyBlockTransferService (NettyBlockTransferService.scala:init(81)): Server created on 172.36.33.155:35509
2022-12-07 15:58:44,988 INFO [Thread-12] storage.BlockManager (Logging.scala:logInfo(57)): Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2022-12-07 15:58:44,999 INFO [Thread-12] storage.BlockManagerMaster (Logging.scala:logInfo(57)): Registering BlockManager BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,004 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.36.33.155:35509 with 5.8 GiB RAM, BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,009 INFO [Thread-12] storage.BlockManagerMaster (Logging.scala:logInfo(57)): Registered BlockManager BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,010 INFO [Thread-12] storage.BlockManager (Logging.scala:logInfo(57)): Initialized BlockManager: BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,076 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:<init>(53)): GlueCloudwatchSink: get cloudwatch client using proxy: host null, port -1
2022-12-07 15:58:45,160 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: Obtained credentials from the Instance Profile
2022-12-07 15:58:45,227 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: jobName: hudi-test jobRunId: jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df
2022-12-07 15:58:45,399 INFO [Thread-12] util.log (Log.java:initialized(169)): Logging initialized @10171ms to org.sparkproject.jetty.util.log.Slf4jLog
2022-12-07 15:58:45,523 INFO [Thread-12] history.SingleEventLogFileWriter (Logging.scala:logInfo(57)): Logging events to file:/tmp/spark-event-logs/spark-application-1670428724710.inprogress
2022-12-07 15:58:45,791 INFO [Thread-12] glueexceptionanalysis.EventLogFileWriter (FileWriter.scala:start(51)): Started file writer for com.amazonaws.services.glueexceptionanalysis.EventLogFileWriter
2022-12-07 15:58:45,807 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): createChildTask API response code 200
2022-12-07 15:58:45,809 INFO [allocator] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): executor task g-523d741ce5530b03650113bbb25bb74c68b29f8e created for executor 1
2022-12-07 15:58:45,810 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): creating executor task for executor 2; clientToken gr_eb5a0158-1b23-4c96-b077-73337a6fd062_e_2_a_spark-application-1670428724710
2022-12-07 15:58:45,828 INFO [Thread-12] glueexceptionanalysis.EventLogSocketWriter (SocketWriter.scala:start(34)): Socket client for com.amazonaws.services.glueexceptionanalysis.EventLogSocketWriter started correctly
2022-12-07 15:58:45,866 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Registered listener com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener
2022-12-07 15:58:45,922 INFO [Thread-12] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2022-12-07 15:58:46,002 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): createChildTask API response code 200
2022-12-07 15:58:46,009 INFO [allocator] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): executor task g-5659a8df9288c783a41f61997de7901665d58b93 created for executor 2
2022-12-07 15:58:47,010 INFO [Thread-12] internal.SharedState (Logging.scala:logInfo(57)): Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/tmp/spark-warehouse').
2022-12-07 15:58:47,011 INFO [Thread-12] internal.SharedState (Logging.scala:logInfo(57)): Warehouse path is 'file:/tmp/spark-warehouse'.
2022-12-07 15:58:48,223 INFO [Thread-12] glue.GlueContext (GlueContext.scala:<init>(120)): GlueMetrics configured and enabled
2022-12-07 15:58:54,783 WARN [Thread-12] impl.MetricsConfig (MetricsConfig.java:loadFirst(136)): Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2022-12-07 15:58:54,894 INFO [Thread-12] impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(374)): Scheduled Metric snapshot period at 10 second(s).
2022-12-07 15:58:54,895 INFO [Thread-12] impl.MetricsSystemImpl (MetricsSystemImpl.java:start(191)): s3a-file-system metrics system started
2022-12-07 15:58:55,829 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:initTableAndGetMetaClient(344)): Initializing s3a://glue-learn-begineers/tmp/users as hoodie table s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:03,960 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.34.11.224:53240) with ID 2, ResourceProfileId 0
2022-12-07 15:59:03,963 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(57)): Got executor added event for 2 @ 1670428743962
2022-12-07 15:59:03,964 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): connected executor 2
2022-12-07 15:59:04,637 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.34.11.224:42111 with 5.8 GiB RAM, BlockManagerId(2, 172.34.11.224, 42111, None)
2022-12-07 15:59:05,498 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:05,942 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:<init>(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties
2022-12-07 15:59:06,103 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:06,103 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:initTableAndGetMetaClient(381)): Finished initializing Table of type COPY_ON_WRITE from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:06,383 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:write(221)): Registered avro schema : { "type": "record", "name": "users_record", "namespace": "hoodie.users", "fields": [ { "name": "emp_id", "type": [ "null", "long" ], "default": null }, { "name": "employee_name", "type": [ "null", "string" ], "default": null }, { "name": "department", "type": [ "null", "string" ], "default": null }, { "name": "state", "type": [ "null", "string" ], "default": null }, { "name": "salary", "type": [ "null", "long" ], "default": null }, { "name": "age", "type": [ "null", "long" ], "default": null }, { "name": "bonus", "type": [ "null", "long" ], "default": null }, { "name": "ts", "type": [ "null", "long" ], "default": null } ] }
2022-12-07 15:59:06,985 INFO [Thread-12] codegen.CodeGenerator (Logging.scala:logInfo(57)): Code generated in 323.430584 ms
2022-12-07 15:59:07,239 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineServerHelper.java:startTimelineService(67)): Starting Timeline service !!
2022-12-07 15:59:07,241 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:setHostAddr(100)): Overriding hostIp to (172.36.33.155) found in spark-conf. It was null
2022-12-07 15:59:07,260 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(232)): Creating View Manager with storage type :MEMORY
2022-12-07 15:59:07,262 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(244)): Creating in-memory based Table View
2022-12-07 15:59:07,287 INFO [Thread-12] util.log (Log.java:initialized(193)): Logging initialized @32060ms to org.apache.hudi.org.apache.jetty.util.log.Slf4jLog
2022-12-07 15:59:07,569 INFO [Thread-12] javalin.Javalin (Javalin.java:start(134)): __ __ _ / /____ _ _ __ ____ _ / /(_)____ __ / // __ `/\| \| / // __ `// // // __ \ / /_/ // /_/ / \| \|/ // /_/ // // // / / / \____/ \__,_/ \|___/ \__,_//_//_//_/ /_/ https://javalin.io/documentation
2022-12-07 15:59:07,571 INFO [Thread-12] javalin.Javalin (Javalin.java:start(139)): Starting Javalin ...
2022-12-07 15:59:07,731 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.34.179.199:52900) with ID 1, ResourceProfileId 0
2022-12-07 15:59:07,732 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(57)): Got executor added event for 1 @ 1670428747732
2022-12-07 15:59:07,733 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): connected executor 1
2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (JettyServerUtil.kt:initialize(113)): Listening on http://localhost:45551/
2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (Javalin.java:start(149)): Javalin started in 280ms \o/
2022-12-07 15:59:07,843 INFO [Thread-12] service.TimelineService (TimelineService.java:startService(280)): Starting Timeline server on port :45551
2022-12-07 15:59:07,843 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:startServer(95)): Started embedded timeline server at 172.36.33.155:45551
2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncCompactionEnabled(675)): Config.inlineCompactionEnabled ? false
2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncClusteringEnabled(686)): Config.asyncClusteringEnabled ? false
2022-12-07 15:59:07,860 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:08,079 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:<init>(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties
2022-12-07 15:59:08,226 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:08,226 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(128)): Loading Active commit timeline for s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:08,542 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:<init>(115)): Loaded instants upto : Optional.empty
2022-12-07 15:59:08,550 INFO [Thread-12] client.AbstractHoodieWriteClient (AbstractHoodieWriteClient.java:startCommit(714)): Generate a new instant time: 20221207155854 action: commit
2022-12-07 15:59:08,551 INFO [Thread-12] heartbeat.HoodieHeartbeatClient (HoodieHeartbeatClient.java:start(166)): Received request to start heartbeat for instant time 20221207155854
2022-12-07 15:59:08,751 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.34.179.199:34373 with 5.8 GiB RAM, BlockManagerId(1, 172.34.179.199, 34373, None)
2022-12-07 15:59:09,002 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:createNewInstant(144)): Creating a new instant [==>20221207155854__commit__REQUESTED]
2022-12-07 15:59:09,648 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:09,745 INFO [pool-5-thread-1] glue.LogPusher (Logging.scala:logInfo(57)): uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/
2022-12-07 15:59:10,081 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:<init>(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties
2022-12-07 15:59:10,239 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:10,239 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(128)): Loading Active commit timeline for s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:10,296 INFO [pool-5-thread-1] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(418)): close closed:false s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/spark-application-1670428724710.inprogress
2022-12-07 15:59:10,540 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:<init>(115)): Loaded instants upto : Option{val=[==>20221207155854__commit__REQUESTED]}
2022-12-07 15:59:10,560 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(232)): Creating View Manager with storage type :REMOTE_FIRST
2022-12-07 15:59:10,565 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(252)): Creating remote first table view
2022-12-07 15:59:10,575 INFO [Thread-12] transaction.TransactionManager (TransactionManager.java:beginTransaction(61)): Latest completed transaction instant Optional.empty
2022-12-07 15:59:10,575 INFO [Thread-12] transaction.TransactionManager (TransactionManager.java:beginTransaction(63)): Transaction starting with transaction owner Option{val=[==>20221207155854__commit__INFLIGHT]}
2022-12-07 15:59:10,576 INFO [Thread-12] lock.LockManager (LockManager.java:getLockProvider(96)): LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
2022-12-07 15:59:10,812 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last): File "/tmp/hudi-test.py", line 114, in <module> df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save self._jwrite.save(path) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value)py4j.protocol.Py4JJavaError: An error occurred while calling o136.save.: org.apache.hudi.exception.HoodieException: Unable to load class at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:58) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:90) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:101) at org.apache.hudi.client.transaction.lock.LockManager.getLockProvider(LockManager.java:97) at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:61) at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:64) at org.apache.hudi.client.AbstractHoodieWriteClient.preWrite(AbstractHoodieWriteClient.java:402) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)Caused by: java.lang.ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:54) ... 48 more
2022-12-07 15:59:10,836 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] { "Event": "GlueETLJobExceptionEvent", "Timestamp": 1670428750832, "Failure Reason": "Traceback (most recent call last):\n File \"/tmp/hudi-test.py\", line 114, in <module>\n df.write.format(\"hudi\").options(**hudi_options).mode(\"overwrite\").save(final_base_path)\n File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 1109, in save\n self._jwrite.save(path)\n File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\", line 1305, in __call__\n answer, self.gateway_client, self.target_id, self.name)\n File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 111, in deco\n return f(*a, **kw)\n File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py\", line 328, in get_return_value\n format(target_id, \".\", name), value)\npy4j.protocol.Py4JJavaError: An error occurred while calling o136.save.\n: org.apache.hudi.exception.HoodieException: Unable to load class\n\tat org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:58)\n\tat org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:90)\n\tat org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:101)\n\tat org.apache.hudi.client.transaction.lock.LockManager.getLockProvider(LockManager.java:97)\n\tat org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:61)\n\tat org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:64)\n\tat org.apache.hudi.client.AbstractHoodieWriteClient.preWrite(AbstractHoodieWriteClient.java:402)\n\tat org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156)\n\tat org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)\n\tat org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265)\n\tat org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)\n\tat org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)\n\tat org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)\n\tat org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)\n\tat org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)\n\tat org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)\n\tat org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)\n\tat org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: java.lang.ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider\n\tat java.net.URLClassLoader.findClass(URLClassLoader.java:387)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:418)\n\tat sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:351)\n\tat java.lang.Class.forName0(Native Method)\n\tat java.lang.Class.forName(Class.java:348)\n\tat org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:54)\n\t... 48 more\n", "Stack Trace": [ { "Declaring Class": "get_return_value", "Method Name": "format(target_id, \".\", name), value)", "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", "Line Number": 328 }, { "Declaring Class": "deco", "Method Name": "return f(*a, **kw)", "File Name": "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", "Line Number": 111 }, { "Declaring Class": "__call__", "Method Name": "answer, self.gateway_client, self.target_id, self.name)", "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", "Line Number": 1305 }, { "Declaring Class": "save", "Method Name": "self._jwrite.save(path)", "File Name": "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", "Line Number": 1109 }, { "Declaring Class": "<module>", "Method Name": "df.write.format(\"hudi\").options(**hudi_options).mode(\"overwrite\").save(final_base_path)", "File Name": "/tmp/hudi-test.py", "Line Number": 114 } ], "Last Executed Line number": 114, "script": "hudi-test.py" }
2022-12-07 15:59:10,935 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] Last Executed Line number from script hudi-test.py: 114
2022-12-07 15:59:10,938 INFO [main] glue.ResourceManagerSocketWriter (SocketWriter.scala:start(34)): Socket client for com.amazonaws.services.glue.ResourceManagerSocketWriter started correctly
2022-12-07 15:59:10,946 INFO [main] glue.ProcessLauncher (Logging.scala:logInfo(57)): postprocessing
2022-12-07 15:59:10,947 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error stopping analyzer process null
2022-12-07 15:59:10,949 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error terminating analyzer process null
2022-12-07 15:59:10,950 INFO [main] glue.LogPusher (Logging.scala:logInfo(57)): stopping
Continuous Logging: Shutting down cloudwatch appender.
Continuous Logging: Shutting down cloudwatch appender.
Continuous Logging: Shutting down cloudwatch appender.
Continuous Logging: Shutting down cloudwatch appender.
2022-12-07 15:59:10,967 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Invoking stop() from shutdown hook
2022-12-07 15:59:10,973 INFO [spark-listener-group-shared] glueexceptionanalysis.EventLogFileWriter (FileWriter.scala:stop(70)): Logs, events processed and insights are written to file /tmp/glue-exception-analysis-logs/spark-application-1670428724710
2022-12-07 15:59:10,980 INFO [shutdown-hook-0] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): Shutting down all executors
2022-12-07 15:59:10,986 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Asking each executor to shut down
2022-12-07 15:59:11,010 INFO [dispatcher-event-loop-0] spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(57)): MapOutputTrackerMasterEndpoint stopped!
2022-12-07 15:59:11,029 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): GlueJobEventsInputHandler Interrupted Last event of a job received. Stop event handling.
2022-12-07 15:59:11,030 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping - analyzer netty server
2022-12-07 15:59:11,037 INFO [shutdown-hook-0] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore cleared
2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopped - analyzer netty server2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Processing remaining events in queue before termination
2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Closing rules engine
2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping analyzer log writer
2022-12-07 15:59:11,037 INFO [shutdown-hook-0] storage.BlockManager (Logging.scala:logInfo(57)): BlockManager stopped
2022-12-07 15:59:11,049 INFO [shutdown-hook-0] storage.BlockManagerMaster (Logging.scala:logInfo(57)): BlockManagerMaster stopped
2022-12-07 15:59:11,053 INFO [dispatcher-event-loop-1] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint (Logging.scala:logInfo(57)): OutputCommitCoordinator stopped!
2022-12-07 15:59:11,061 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Successfully stopped SparkContext
2022-12-07 15:59:11,061 INFO [shutdown-hook-0] glue.LogPusher (Logging.scala:logInfo(57)): uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/
2022-12-07 15:59:11,111 INFO [shutdown-hook-0] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(418)): close closed:false s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/spark-application-1670428724710
2022-12-07 15:59:11,142 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Shutdown hook called
2022-12-07 15:59:11,143 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea
2022-12-07 15:59:11,147 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea/pyspark-677e56c8-2549-4513-a1d9-a96b36052360
2022-12-07 15:59:11,151 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(210)): Stopping s3a-file-system metrics system...
2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(216)): s3a-file-system metrics system stopped.
2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)): s3a-file-system metrics system shutdown complete.
<!--EndFragment-->
</body>
</html>
Message No older events at this moment. Retry Preparing ... Wed Dec 7 15:58:22 UTC 2022 /usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=2 --conf spark.executor.memory=10g --conf spark.executor.cores=8 --conf spark.driver.memory=10g --conf spark.default.parallelism=24 --conf spark.sql.shuffle.partitions=24 --conf spark.network.timeout=600 --enable-glue-datacatalog true --job-bookmark-option job-bookmark-enable --TempDir s3://aws-glue-assets-043916019468-us-west-2/temporary/ --extra-jars s3://glue-learn-begineers/jar/spark-avro_2.12-3.0.1.jar,s3://glue-learn-begineers/jar/hudi-spark3-bundle_2.12-0.9.0.jar --JOB_ID j_8cd1407036ab38dae74c05008aba629c7beab90ad673660ec0184b6fb1bcef49 --enable-metrics true --enable-spark-ui true --spark-event-logs-path s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ --enable-job-insights true --JOB_RUN_ID jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df --additional-python-modules faker==11.3.0 --enable-continuous-cloudwatch-log true --scriptLocation s3://aws-glue-assets-043916019468-us-west-2/scripts/hudi-test.py --job-language python --JOB_NAME hudi-test 1670428703170 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Glue analyzer rules s3://aws-glue-jes-prod-us-west-2-assets/etl/analysis-rules/glue3/
INFO 2022-12-07 15:58:25,461 1 com.amazonaws.services.glue.utils.AWSClientUtils$ [main] AWSClientUtils: create aws log client with conf: proxy host null, proxy port -1
INFO 2022-12-07 15:58:25,689 229 com.amazonaws.services.glue.utils.AWSClientUtils$ [main] AWSClientUtils: getGlueClient. proxy host: null , port: -1
GlueTelemetry: Current region us-west-2
GlueTelemetry: Glue Endpoint https://glue.us-west-2.amazonaws.com
GlueTelemetry: Prod env...false
GlueTelemetry: is disabled
Glue ETL Marketplace - Start ETL connector activation process...
Glue ETL Marketplace - no connections are attached to job hudi-test, no need to download custom connector jar for it.
Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues.
Glue ETL Marketplace - ETL connector activation process finished, container setup continues...
Download bucket: glue-learn-begineers key: jar/spark-avro_2.12-3.0.1.jar with usingProxy: false
Download bucket: glue-learn-begineers key: jar/hudi-spark3-bundle_2.12-0.9.0.jar with usingProxy: false
Download bucket: aws-glue-assets-043916019468-us-west-2 key: scripts/hudi-test.py with usingProxy: false
INFO 2022-12-07 15:58:27,947 2487 com.amazonaws.services.glue.PythonModuleInstaller [main] pip3 install --user faker==11.3.0
INFO 2022-12-07 15:58:35,156 9696 com.amazonaws.services.glue.PythonModuleInstaller [main] Collecting faker==11.3.0 Downloading Faker-11.3.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 26.4 MB/s eta 0:00:00Requirement already satisfied: typing-extensions>=3.10.0.2 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (4.4.0)Collecting text-unidecode==1.3 Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.2/78.2 kB 17.4 MB/s eta 0:00:00Requirement already satisfied: python-dateutil>=2.4 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (2.8.2)Requirement already satisfied: six>=1.5 in /home/spark/.local/lib/python3.7/site-packages (from python-dateutil>=2.4->faker==11.3.0) (1.16.0)Installing collected packages: text-unidecode, fakerSuccessfully installed faker-11.3.0 text-unidecode-1.3
INFO 2022-12-07 15:58:35,156 9696 com.amazonaws.services.glue.PythonModuleInstaller [main] WARNING: The script faker is installed in '/home/spark/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Launching ...
Wed Dec 7 15:58:35 UTC 2022
/usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar:/tmp/ -Dlog4j.configuration=log4j -server -Xmx10g -XX:+UseG1GC -XX:MaxHeapFreeRatio=70 -XX:InitiatingHeapOccupancyPercent=45 -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseCompressedOops -Djavax.net.ssl.trustStore=/opt/amazon/certs/ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=/opt/amazon/certs/rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=/opt/amazon/certs/redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:/opt/amazon/certs/RDSTrustStore.jks -Dspark.network.timeout=600 -Dspark.metrics.conf..source.jvm.class=org.apache.spark.metrics.source.JvmSource -Dspark.eventLog.enabled=true -Dspark.driver.cores=4 -Dspark.metrics.conf..source.system.class=org.apache.spark.metrics.source.SystemMetricsSource -Dspark.glue.endpoint=https://glue-jes.us-west-2.amazonaws.com -Dspark.default.parallelism=8 -Dspark.sql.parquet.output.committer.class=com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter -Dspark.cloudwatch.logging.conf.jobRunId=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.glue.enable-job-insights=true -Dspark.glueExceptionAnalysisEventLog.dir=/tmp/glue-exception-analysis-logs/ -Dspark.hadoop.mapred.output.direct.EmrFileSystem=true -Dspark.hadoop.aws.glue.endpoint=https://glue.us-west-2.amazonaws.com -Dspark.glueJobInsights.enabled=true -Dspark.glue.USE_PROXY=false -Dspark.eventLog.dir=/tmp/spark-event-logs/ -Dspark.extraListeners=com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener -Dspark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem -Dspark.authenticate.secret=
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Continuous Logging: Creating cloudwatch appender. Continuous Logging: Creating cloudwatch appender. Continuous Logging: Creating cloudwatch appender. Continuous Logging: Creating cloudwatch appender. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: A number (14) of logging calls during the initialization phase have been intercepted and are SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
SLF4J: See also http://www.slf4j.org/codes.html#replay
2022-12-07 15:58:38,910 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(56)): Unable to read clusterId from http://localhost:8321/configuration, trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json
2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(63)): Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json
2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(71)): Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look
2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Class path contains multiple SLF4J bindings.
2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2022-12-07 15:58:40,320 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN Please initialize the log4j system properly.
2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2022-12-07 15:58:41,386 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): ==== Netty Server Started ====
2022-12-07 15:58:41,389 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): input rulesFilePath=/tmp/glue_app_analyzer_rules
2022-12-07 15:58:41,814 INFO [main] glue.SafeLogging (Logging.scala:logInfo(57)): Initializing logging subsystem
2022-12-07 15:58:43,864 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Running Spark version 3.1.1-amzn-0
2022-12-07 15:58:43,911 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ==============================================================
2022-12-07 15:58:43,912 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): No custom resources configured for spark.driver.
2022-12-07 15:58:43,913 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ==============================================================
2022-12-07 15:58:43,913 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Submitted application: nativespark-hudi-test-jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df
2022-12-07 15:58:43,935 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 10240, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2022-12-07 15:58:43,953 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Limiting resource is cpus at 4 tasks per executor
2022-12-07 15:58:43,956 INFO [Thread-12] resource.ResourceProfileManager (Logging.scala:logInfo(57)): Added ResourceProfile id: 0
2022-12-07 15:58:44,032 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls to: spark
2022-12-07 15:58:44,033 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls to: spark
2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls groups to:
2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls groups to:
2022-12-07 15:58:44,035 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
2022-12-07 15:58:44,361 INFO [Thread-12] util.Utils (Logging.scala:logInfo(57)): Successfully started service 'sparkDriver' on port 34581.
2022-12-07 15:58:44,409 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering MapOutputTracker
2022-12-07 15:58:44,446 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMaster
2022-12-07 15:58:44,480 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2022-12-07 15:58:44,481 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): BlockManagerMasterEndpoint up
2022-12-07 15:58:44,486 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMasterHeartbeat
2022-12-07 15:58:44,508 INFO [Thread-12] storage.DiskBlockManager (Logging.scala:logInfo(57)): Created local directory at /tmp/blockmgr-28d77124-2a0a-4c4f-ae8f-d15612884b2a
2022-12-07 15:58:44,543 INFO [Thread-12] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore started with capacity 5.8 GiB
2022-12-07 15:58:44,566 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering OutputCommitCoordinator
2022-12-07 15:58:44,715 INFO [Thread-12] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): JESAsSchedulerBackendEndpoint
2022-12-07 15:58:44,716 INFO [Thread-12] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): JESSchedulerBackend
2022-12-07 15:58:44,723 INFO [Thread-12] scheduler.JESSchedulerBackend (JESClusterManager.scala:
2022-12-07 15:59:06,985 INFO [Thread-12] codegen.CodeGenerator (Logging.scala:logInfo(57)): Code generated in 323.430584 ms 2022-12-07 15:59:07,239 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineServerHelper.java:startTimelineService(67)): Starting Timeline service !! 2022-12-07 15:59:07,241 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:setHostAddr(100)): Overriding hostIp to (172.36.33.155) found in spark-conf. It was null 2022-12-07 15:59:07,260 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(232)): Creating View Manager with storage type :MEMORY 2022-12-07 15:59:07,262 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(244)): Creating in-memory based Table View 2022-12-07 15:59:07,287 INFO [Thread-12] util.log (Log.java:initialized(193)): Logging initialized @32060ms to org.apache.hudi.org.apache.jetty.util.log.Slf4jLog 2022-12-07 15:59:07,569 INFO [Thread-12] javalin.Javalin (Javalin.java:start(134)):
/ /____ _ _ __ ____ _ / /(_)____
__ / // __ `/| | / // __ `// // // __ \
/ /_/ // /_/ / | |/ // /_/ // // // / / /
\____/ \__,_/ |___/ \__,_//_//_//_/ /_/
https://javalin.io/documentation
2022-12-07 15:59:07,571 INFO [Thread-12] javalin.Javalin (Javalin.java:start(139)): Starting Javalin ...
2022-12-07 15:59:07,731 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.34.179.199:52900) with ID 1, ResourceProfileId 0
2022-12-07 15:59:07,732 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(57)): Got executor added event for 1 @ 1670428747732
2022-12-07 15:59:07,733 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): connected executor 1
2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (JettyServerUtil.kt:initialize(113)): Listening on http://localhost:45551/
2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (Javalin.java:start(149)): Javalin started in 280ms \o/
2022-12-07 15:59:07,843 INFO [Thread-12] service.TimelineService (TimelineService.java:startService(280)): Starting Timeline server on port :45551
2022-12-07 15:59:07,843 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:startServer(95)): Started embedded timeline server at 172.36.33.155:45551
2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncCompactionEnabled(675)): Config.inlineCompactionEnabled ? false
2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncClusteringEnabled(686)): Config.asyncClusteringEnabled ? false
2022-12-07 15:59:07,860 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:
2022-12-07 15:59:10,836 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis]
{
"Event": "GlueETLJobExceptionEvent",
"Timestamp": 1670428750832,
"Failure Reason": "Traceback (most recent call last):\n File \"/tmp/hudi-test.py\", line 114, in
2022-12-07 15:59:10,935 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] Last Executed Line number from script hudi-test.py: 114 2022-12-07 15:59:10,938 INFO [main] glue.ResourceManagerSocketWriter (SocketWriter.scala:start(34)): Socket client for com.amazonaws.services.glue.ResourceManagerSocketWriter started correctly 2022-12-07 15:59:10,946 INFO [main] glue.ProcessLauncher (Logging.scala:logInfo(57)): postprocessing 2022-12-07 15:59:10,947 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error stopping analyzer process null 2022-12-07 15:59:10,949 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error terminating analyzer process null 2022-12-07 15:59:10,950 INFO [main] glue.LogPusher (Logging.scala:logInfo(57)): stopping Continuous Logging: Shutting down cloudwatch appender.
Continuous Logging: Shutting down cloudwatch appender. Continuous Logging: Shutting down cloudwatch appender. Continuous Logging: Shutting down cloudwatch appender. 2022-12-07 15:59:10,967 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Invoking stop() from shutdown hook 2022-12-07 15:59:10,973 INFO [spark-listener-group-shared] glueexceptionanalysis.EventLogFileWriter (FileWriter.scala:stop(70)): Logs, events processed and insights are written to file /tmp/glue-exception-analysis-logs/spark-application-1670428724710 2022-12-07 15:59:10,980 INFO [shutdown-hook-0] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): Shutting down all executors 2022-12-07 15:59:10,986 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Asking each executor to shut down 2022-12-07 15:59:11,010 INFO [dispatcher-event-loop-0] spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(57)): MapOutputTrackerMasterEndpoint stopped! 2022-12-07 15:59:11,029 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): GlueJobEventsInputHandler Interrupted Last event of a job received. Stop event handling. 2022-12-07 15:59:11,030 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping - analyzer netty server 2022-12-07 15:59:11,037 INFO [shutdown-hook-0] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore cleared 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopped - analyzer netty server 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Processing remaining events in queue before termination 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Closing rules engine 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping analyzer log writer 2022-12-07 15:59:11,037 INFO [shutdown-hook-0] storage.BlockManager (Logging.scala:logInfo(57)): BlockManager stopped 2022-12-07 15:59:11,049 INFO [shutdown-hook-0] storage.BlockManagerMaster (Logging.scala:logInfo(57)): BlockManagerMaster stopped 2022-12-07 15:59:11,053 INFO [dispatcher-event-loop-1] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint (Logging.scala:logInfo(57)): OutputCommitCoordinator stopped! 2022-12-07 15:59:11,061 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Successfully stopped SparkContext 2022-12-07 15:59:11,061 INFO [shutdown-hook-0] glue.LogPusher (Logging.scala:logInfo(57)): uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ 2022-12-07 15:59:11,111 INFO [shutdown-hook-0] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(418)): close closed:false s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/spark-application-1670428724710 2022-12-07 15:59:11,142 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Shutdown hook called 2022-12-07 15:59:11,143 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea 2022-12-07 15:59:11,147 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea/pyspark-677e56c8-2549-4513-a1d9-a96b36052360 2022-12-07 15:59:11,151 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(210)): Stopping s3a-file-system metrics system... 2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(216)): s3a-file-system metrics system stopped. 2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)): s3a-file-system metrics system shutdown complete.
@nsivabalan confirmed that it was not reproducible with 0.12.1 in our Jenkins test infra.
i have added more details for you guys 🗡️
Link https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html
Now GLue 4.0 supports hudi out of box but when trying with DynamoDB it fails i tried with glue 3.0 as well using custom jar file both didn't work
Any updates ?
@soumilshah1995 What you're hitting is the ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
. If you use the right bundle of hudi, this class should be present. In OSS, for making use of DynamoDBBasedLockProvider
, you need to have hudi-aws-bundle
in classpath as well. This looks like the default glue setup issue and I would encourage you to raise an AWS support ticket. We have been running continuous tests with different lock providers for multi-writer scenarios successfully.
Hi not sure if you are aware glue 4 does not require you to specify class path or jar files hudi is not natively supported in glue 4.0. feel free to give a try if needed i have provided detailed steps with instruction to replicate the error. i would appreciate the help if someone can try these and let me know exactly what needs to be done @nsivabalan
closing this ticket :D
Any one is curios here is code that works
try:
import os
import sys
import uuid
import boto3
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, asc, desc
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from faker import Faker
print("All modules are loaded .....")
except Exception as e:
print("Some modules are missing {} ".format(e))
# ----------------------------------------------------------------------------------------
# Settings
# -----------------------------------------------------------------------------------------
database_name1 = "hudidb"
table_name = "hudi_table"
base_s3_path = "s3a://glue-learn-begineers"
final_base_path = "{base_s3_path}/{table_name}".format(
base_s3_path=base_s3_path, table_name=table_name
)
curr_session = boto3.session.Session()
curr_region = curr_session.region_name
# ----------------------------------------------------------------------------------------------------
global faker
faker = Faker()
class DataGenerator(object):
@staticmethod
def get_data():
return [
(
x,
faker.name(),
faker.random_element(elements=('IT', 'HR', 'Sales', 'Marketing')),
faker.random_element(elements=('CA', 'NY', 'TX', 'FL', 'IL', 'RJ')),
faker.random_int(min=10000, max=150000),
faker.random_int(min=18, max=60),
faker.random_int(min=0, max=100000),
faker.unix_time()
) for x in range(5)
]
def create_spark_session():
spark = SparkSession \
.builder \
.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
.config('spark.sql.hive.convertMetastoreParquet','false') \
.config('spark.sql.legacy.pathOptionBehavior.enabled', 'true') \
.getOrCreate()
return spark
spark = create_spark_session()
sc = spark.sparkContext
glueContext = GlueContext(sc)
"""
CHOOSE ONE
"hoodie.datasource.write.storage.type": "MERGE_ON_READ",
"hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
"""
hudi_options = {
'hoodie.table.name': table_name,
"hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
'hoodie.datasource.write.recordkey.field': 'emp_id',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'state',
'hoodie.datasource.hive_sync.enable': 'true',
"hoodie.datasource.hive_sync.mode":"hms",
'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
'hoodie.datasource.hive_sync.database': database_name1,
'hoodie.datasource.hive_sync.table': table_name,
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.write.concurrency.mode' : 'optimistic_concurrency_control'
,'hoodie.cleaner.policy.failed.writes' : 'LAZY'
,'hoodie.write.lock.provider' : 'org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider'
,'hoodie.write.lock.dynamodb.table' : 'hudi-blog-lock-table'
,'hoodie.write.lock.dynamodb.partition_key' : 'tablename'
,'hoodie.write.lock.dynamodb.region' : '{0}'.format(curr_region)
,'hoodie.write.lock.dynamodb.endpoint_url' : 'dynamodb.{0}.amazonaws.com'.format(curr_region)
,'hoodie.write.lock.dynamodb.billing_mode' : 'PAY_PER_REQUEST'
,'hoodie.bulkinsert.shuffle.parallelism': 2000
}
# ====================================================
"""Create Spark Data Frame """
# ====================================================
data = DataGenerator.get_data()
columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
df = spark.createDataFrame(data=data, schema=columns)
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)
# ====================================================
"""APPEND """
# ====================================================
impleDataUpd = [
(6, "This is APPEND", "Sales", "RJ", 81000, 30, 23000, 827307999),
(7, "This is APPEND", "Engineering", "RJ", 79000, 53, 15000, 1627694678),
]
columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)
# ====================================================
"""UPDATE """
# ====================================================
impleDataUpd = [
(3, "this is update on data lake", "Sales", "RJ", 81000, 30, 23000, 827307999),
]
columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)
# ====================================================
"""SOFT DELETE """
# ====================================================
# from pyspark.sql.functions import lit
# from functools import reduce
#
#
# print("\n")
# soft_delete_ds = spark.sql("SELECT * FROM hudidb.hudi_table_rt where emp_id='4' ")
# print(soft_delete_ds.show())
# print("\n")
#
# # prepare the soft deletes by ensuring the appropriate fields are nullified
# meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_record_key","_hoodie_partition_path", "_hoodie_file_name"]
# excluded_columns = meta_columns + ["ts", "emp_id", "partitionpath"]
# nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, list(map(lambda field: (field.name, field.dataType), soft_delete_ds.schema.fields))))
#
# soft_delete_df = reduce(lambda df, col: df.withColumn(col[0], lit(None).cast(col[1])),
# nullify_columns, reduce(lambda df,col: df.drop(col[0]), meta_columns, soft_delete_ds))
#
#
# soft_delete_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)
#
#
#
# # ====================================================
# """HARD DELETE """
# # ====================================================
#
# ds = spark.sql("SELECT * FROM hudidb.hudi_table_rt where emp_id='2' ")
#
# hudi_hard_delete_options = {
# 'hoodie.table.name': table_name,
# 'hoodie.datasource.write.recordkey.field': 'emp_id',
# 'hoodie.datasource.write.table.name': table_name,
# 'hoodie.datasource.write.operation': 'delete',
# 'hoodie.datasource.write.precombine.field': 'ts',
# 'hoodie.upsert.shuffle.parallelism': 2,
# 'hoodie.insert.shuffle.parallelism': 2
# }
#
# ds.write.format("hudi").options(**hudi_hard_delete_options).mode("append").save(final_base_path)
sorry, did we find the root cause why it was failing earlier and later we got it resolved. did newer version of hudi worked? or any config changes were needed.
Tips before filing an issue
HUDI Dynamodb Concurrency Control
Have you gone through our FAQs?
Yes i have been through all Blog post and articles
https://hudi.apache.org/docs/next/concurrency_control/
https://hudi.apache.org/docs/configurations/#DynamoDB-based-Locks-Configurations
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
Yes
Describe the problem you faced
Code works great i am trying to integrate with DynamoDB for concurrency control but HUDI throws error here is code attached
except Exception as e: print("Some modules are missing {} ".format(e))
os.environ['PYSPARK_PYTHON'] = sys.executable os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
args = getResolvedOptions(sys.argv, ['base_s3_path', 'table_name'])
base_s3_path = args['base_s3_path'] table_name = args['table_name']
final_base_path = "{base_s3_path}/tmp/{table_name}".format( base_s3_path=base_s3_path, table_name=table_name )
target_s3_path = "{base_s3path}/tmp/hudi{table_name}_target".format( base_s3_path=base_s3_path, table_name=table_name ) database_name1 = "mydb" curr_region = 'us-east-1'
global faker faker = Faker()
class DataGenerator(object):
def create_spark_session(): spark = SparkSession \ .builder \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .getOrCreate() return spark
spark = create_spark_session() sc = spark.sparkContext glueContext = GlueContext(sc)
hudi_options = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.recordkey.field': 'emp_id', 'hoodie.datasource.write.partitionpath.field': 'state', 'hoodie.datasource.write.table.name': table_name, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'ts',
}
====================================================
"""Create Spark Data Frame """
====================================================
data = DataGenerator.get_data() columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"] df = spark.createDataFrame(data=data, schema=columns) df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)
====================================================
"""APPEND """
====================================================
impleDataUpd = [
(3, "xxx", "Sales", "RJ", 81000, 30, 23000, 827307999),
(7, "x change", "Engineering", "RJ", 79000, 53, 15000, 1627694678),
]
#
columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)
====================================================
final_read_df = spark.read.format("hudi").load(final_base_path)
final_read_df.createOrReplaceTempView("hudi_users_view")
glueContext.purge_s3_path(target_s3_path,{"retentionPeriod": 0, "excludeStorageClasses": ["STANDARD_IA"]} )
# #
spark.sql(f"CREATE DATABASE IF NOT EXISTS hudi_demo")
spark.sql(f"DROP TABLE IF EXISTS hudi_demo.hudi_users")
spark.sql(f"CREATE TABLE IF NOT EXISTS hudi_demo.hudi_users USING PARQUET LOCATION '{target_s3_path}' as (SELECT * from hudi_users_view)")