soumilshah1995 commented 2 years ago

Tips before filing an issue

HUDI Dynamodb Concurrency Control

Have you gone through our FAQs?
Yes i have been through all Blog post and articles
https://hudi.apache.org/docs/next/concurrency_control/
https://hudi.apache.org/docs/configurations/#DynamoDB-based-Locks-Configurations
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
Yes

Describe the problem you faced

Code works great i am trying to integrate with DynamoDB for concurrency control but HUDI throws error here is code attached


try:
import os
import sys
import uuid

import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, asc, desc
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext

from faker import Faker

print("All modules are loaded .....")

except Exception as e: print("Some modules are missing {} ".format(e))

os.environ['PYSPARK_PYTHON'] = sys.executable os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

args = getResolvedOptions(sys.argv, ['base_s3_path', 'table_name'])

base_s3_path = args['base_s3_path'] table_name = args['table_name']

final_base_path = "{base_s3_path}/tmp/{table_name}".format( base_s3_path=base_s3_path, table_name=table_name )

target_s3_path = "{base_s3path}/tmp/hudi{table_name}_target".format( base_s3_path=base_s3_path, table_name=table_name ) database_name1 = "mydb" curr_region = 'us-east-1'

global faker faker = Faker()

class DataGenerator(object):

@staticmethod
def get_data():
    return [
        (
            x,
            faker.name(),
            faker.random_element(elements=('IT', 'HR', 'Sales', 'Marketing')),
            faker.random_element(elements=('CA', 'NY', 'TX', 'FL', 'IL', 'RJ')),
            faker.random_int(min=10000, max=150000),
            faker.random_int(min=18, max=60),
            faker.random_int(min=0, max=100000),
            faker.unix_time()
        ) for x in range(10)
    ]

def create_spark_session(): spark = SparkSession \ .builder \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .getOrCreate() return spark

spark = create_spark_session() sc = spark.sparkContext glueContext = GlueContext(sc)

hudi_options = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.recordkey.field': 'emp_id', 'hoodie.datasource.write.partitionpath.field': 'state', 'hoodie.datasource.write.table.name': table_name, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'ts',

'hoodie.datasource.hive_sync.enable': 'true',
"hoodie.datasource.hive_sync.mode":"hms",
'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
'hoodie.datasource.hive_sync.database': database_name1,
'hoodie.datasource.hive_sync.table': table_name,
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.partition_fields': 'state',
'hoodie.datasource.write.hive_style_partitioning': 'true',

'hoodie.write.concurrency.mode' : 'optimistic_concurrency_control'
,'hoodie.cleaner.policy.failed.writes' : 'LAZY'
,'hoodie.write.lock.provider' : 'org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider'
,'hoodie.write.lock.dynamodb.table' : 'hudi-blog-lock-table'
,'hoodie.write.lock.dynamodb.partition_key' : 'tablename'
,'hoodie.write.lock.dynamodb.region' : '{0}'.format(curr_region)
,"hoodie.aws.access.key":"XXXXXXXXXXXXXXXX"
,"hoodie.aws.secret.key":"XXXXXXXXXXXXXXXXX"

}

====================================================

"""Create Spark Data Frame """

====================================================

data = DataGenerator.get_data() columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"] df = spark.createDataFrame(data=data, schema=columns) df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)

====================================================

"""APPEND """

====================================================

impleDataUpd = [

(3, "xxx", "Sales", "RJ", 81000, 30, 23000, 827307999),

(7, "x change", "Engineering", "RJ", 79000, 53, 15000, 1627694678),

]

#

columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]

usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)

usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)

====================================================

final_read_df = spark.read.format("hudi").load(final_base_path)

final_read_df.createOrReplaceTempView("hudi_users_view")

glueContext.purge_s3_path(target_s3_path,{"retentionPeriod": 0, "excludeStorageClasses": ["STANDARD_IA"]} )

# #

spark.sql(f"CREATE DATABASE IF NOT EXISTS hudi_demo")

spark.sql(f"DROP TABLE IF EXISTS hudi_demo.hudi_users")

spark.sql(f"CREATE TABLE IF NOT EXISTS hudi_demo.hudi_users USING PARQUET LOCATION '{target_s3_path}' as (SELECT * from hudi_users_view)")


![image](https://user-images.githubusercontent.com/39345855/203575607-b7d9f884-66ae-4df3-a361-d62a44d1070b.png)

![image](https://user-images.githubusercontent.com/39345855/203575719-346c9fc0-d85c-40b3-a903-02566f899068.png)

### Please not error arises when adding dynamodb other wise all settings works fine as expected 
JAR FILE i am using can be found

https://github.com/soumilshah1995/getting-started-with-pyspark-and-apache-hudi-glue-/tree/main/Video%202/jars

* hudi-spark3-bundle_2.12-0.9.0.jar 
* spark-avro_2.12-3.0.1.jar

Glue Version : 3.0

codope commented 2 years ago

@soumilshah1995 Can you please share the full stacktrace? Also, I don't see this required config being set https://hudi.apache.org/docs/configurations#hoodiewritelockdynamodbendpoint_url

soumilshah1995 commented 2 years ago

@soumilshah1995 Can you please share the full stacktrace? Also, I don't see this required config being set https://hudi.apache.org/docs/configurations#hoodiewritelockdynamodbendpoint_url

Bro I don't have logs, but I tried a lot and it doesn't function as described; you can test it yourself if necessary. By the way i am a youtuber and i really love Apache Hudi great job on that and thank you for making it open source

nsivabalan commented 2 years ago

@soumilshah1995 : can you share the driver logs you see the full stacktrace. we wanted to see what exception is seen.

xushiyan commented 2 years ago

@codope let's try to reproduce it with the code snippet. noticed that the reported version is 0.9.0. we can see if things working fine with 0.12.x. if yes, we can share the working setup with @soumilshah1995 here.

codope commented 2 years ago

@nsivabalan confirmed that it was not reproducible with 0.12.1 in our Jenkins test infra.

soumilshah1995 commented 2 years ago

s 0.9.0

Did you guys try did it work for you i am very curious ? i am attaching more details shortly :D logs and info

soumilshah1995 commented 2 years ago

@codope and @nsivabalan

here are more details

<html>
<body>
<!--StartFragment-->

Message
--
No older events at this moment. Retry
Preparing ...
Wed Dec  7 15:58:22 UTC 2022
/usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=2 --conf spark.executor.memory=10g --conf spark.executor.cores=8 --conf spark.driver.memory=10g --conf spark.default.parallelism=24 --conf spark.sql.shuffle.partitions=24 --conf spark.network.timeout=600   --enable-glue-datacatalog true --job-bookmark-option job-bookmark-enable --TempDir s3://aws-glue-assets-043916019468-us-west-2/temporary/ --extra-jars s3://glue-learn-begineers/jar/spark-avro_2.12-3.0.1.jar,s3://glue-learn-begineers/jar/hudi-spark3-bundle_2.12-0.9.0.jar  --JOB_ID j_8cd1407036ab38dae74c05008aba629c7beab90ad673660ec0184b6fb1bcef49 --enable-metrics true --enable-spark-ui true --spark-event-logs-path s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ --enable-job-insights true  --JOB_RUN_ID jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df --additional-python-modules faker==11.3.0 --enable-continuous-cloudwatch-log true --scriptLocation s3://aws-glue-assets-043916019468-us-west-2/scripts/hudi-test.py  --job-language python --JOB_NAME hudi-test
1670428703170
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Glue analyzer rules s3://aws-glue-jes-prod-us-west-2-assets/etl/analysis-rules/glue3/
 
INFO    2022-12-07 15:58:25,461 1   com.amazonaws.services.glue.utils.AWSClientUtils$   [main]  AWSClientUtils: create aws log client with conf: proxy host null, proxy port -1
INFO    2022-12-07 15:58:25,689 229 com.amazonaws.services.glue.utils.AWSClientUtils$   [main]  AWSClientUtils: getGlueClient. proxy host: null , port: -1
GlueTelemetry: Current region us-west-2
GlueTelemetry: Glue Endpoint https://glue.us-west-2.amazonaws.com
GlueTelemetry: Prod env...falseGlueTelemetry: is disabled
Glue ETL Marketplace - Start ETL connector activation process...Glue ETL Marketplace - no connections are attached to job hudi-test, no need to download custom connector jar for it.
Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues.Glue ETL Marketplace - ETL connector activation process finished, container setup continues...
Download bucket: glue-learn-begineers key: jar/spark-avro_2.12-3.0.1.jar with usingProxy: false
Download bucket: glue-learn-begineers key: jar/hudi-spark3-bundle_2.12-0.9.0.jar with usingProxy: false
Download bucket: aws-glue-assets-043916019468-us-west-2 key: scripts/hudi-test.py with usingProxy: false
INFO    2022-12-07 15:58:27,947 2487    com.amazonaws.services.glue.PythonModuleInstaller   [main]  pip3 install  --user faker==11.3.0
INFO    2022-12-07 15:58:35,156 9696    com.amazonaws.services.glue.PythonModuleInstaller   [main]  Collecting faker==11.3.0  Downloading Faker-11.3.0-py3-none-any.whl (1.2 MB)     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 26.4 MB/s eta 0:00:00Requirement already satisfied: typing-extensions>=3.10.0.2 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (4.4.0)Collecting text-unidecode==1.3  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.2/78.2 kB 17.4 MB/s eta 0:00:00Requirement already satisfied: python-dateutil>=2.4 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (2.8.2)Requirement already satisfied: six>=1.5 in /home/spark/.local/lib/python3.7/site-packages (from python-dateutil>=2.4->faker==11.3.0) (1.16.0)Installing collected packages: text-unidecode, fakerSuccessfully installed faker-11.3.0 text-unidecode-1.3
INFO    2022-12-07 15:58:35,156 9696    com.amazonaws.services.glue.PythonModuleInstaller   [main]    WARNING: The script faker is installed in '/home/spark/.local/bin' which is not on PATH.  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Launching ...
Wed Dec  7 15:58:35 UTC 2022
/usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar:/tmp/* -Dlog4j.configuration=log4j -server -Xmx10g -XX:+UseG1GC -XX:MaxHeapFreeRatio=70 -XX:InitiatingHeapOccupancyPercent=45 -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseCompressedOops -Djavax.net.ssl.trustStore=/opt/amazon/certs/ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=/opt/amazon/certs/rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=/opt/amazon/certs/redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:/opt/amazon/certs/RDSTrustStore.jks -Dspark.network.timeout=600 -Dspark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource -Dspark.eventLog.enabled=true -Dspark.driver.cores=4 -Dspark.metrics.conf.*.source.system.class=org.apache.spark.metrics.source.SystemMetricsSource -Dspark.glue.endpoint=https://glue-jes.us-west-2.amazonaws.com -Dspark.default.parallelism=8 -Dspark.sql.parquet.output.committer.class=com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter -Dspark.cloudwatch.logging.conf.jobRunId=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.glue.enable-job-insights=true -Dspark.glueExceptionAnalysisEventLog.dir=/tmp/glue-exception-analysis-logs/ -Dspark.hadoop.mapred.output.direct.EmrFileSystem=true -Dspark.hadoop.aws.glue.endpoint=https://glue.us-west-2.amazonaws.com -Dspark.glueJobInsights.enabled=true -Dspark.glue.USE_PROXY=false -Dspark.eventLog.dir=/tmp/spark-event-logs/ -Dspark.extraListeners=com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener -Dspark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem -Dspark.authenticate.secret=<HIDDEN> -Dspark.glue.additional-python-modules=faker==11.3.0 -Dspark.metrics.conf.*.sink.GlueCloudwatch.jobRunId=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.hadoop.mapred.output.direct.NativeS3FileSystem=true -Dspark.pyspark.python=/usr/bin/python3 -Dspark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory -Dspark.cloudwatch.logging.ui.showConsoleProgress=true -Dspark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 -Dspark.hadoop.glue.michiganCredentialsProviderProxy=com.amazonaws.services.glue.remote.LakeformationCredentialsProvider -Dspark.metrics.conf.driver.source.aggregate.class=org.apache.spark.metrics.source.AggregateMetricsSource -Dspark.glue.enable-continuous-cloudwatch-log=true -Dspark.glue.JOB_RUN_ID=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.sql.catalogImplementation=hive -Dspark.dynamicAllocation.maxExecutors=2 -Dspark.ui.enabled=false -Dspark.driver.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar -Dspark.sql.shuffle.partitions=8 -Dspark.authenticate=true -Dspark.glue.GLUE_TASK_GROUP_ID=eb5a0158-1b23-4c96-b077-73337a6fd062 -Dspark.dynamicAllocation.enabled=false -Dspark.shuffle.service.enabled=false -Dspark.hadoop.mapred.output.committer.class=org.apache.hadoop.mapred.DirectOutputCommitter -Dspark.executor.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar -Dspark.glue.GLUE_VERSION=3.0 -Dspark.executor.cores=4 -Dspark.hadoop.lakeformation.credentials.url=http://localhost:9998/lakeformationcredentials -Dspark.app.name=nativespark-hudi-test-jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.executor.instances=2 -Dspark.metrics.conf.*.sink.GlueCloudwatch.jobName=hudi-test -Dspark.rpc.askTimeout=600 -Dspark.sql.parquet.fs.optimized.committer.optimization-enabled=true -Dspark.metrics.conf.*.source.s3.class=org.apache.spark.metrics.source.S3FileSystemSource -Dspark.driver.host=172.36.33.155 -Dspark.metrics.conf.*.sink.GlueCloudwatch.namespace=Glue -Dspark.pyFiles= -Dspark.executor.memory=10g -Dspark.metrics.conf.*.sink.GlueCloudwatch.class=org.apache.spark.metrics.sink.GlueCloudwatchSink -Dspark.driver.memory=10g -Dspark.hadoop.fs.s3.buffer.dir=/tmp/hadoop-spark/s3 -Dspark.glue.GLUE_COMMAND_CRITERIA=glueetl -Dspark.master=jes -Dspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false -Dspark.unsafe.sorter.spill.read.ahead.enabled=false -Dspark.hadoop.parquet.enable.summary-metadata=false -Dspark.glueAppInsightsLog.dir=/tmp/glue-app-insights-logs/ -Dspark.glue.jobLanguage=python -Dspark.glue.extra-jars=s3://glue-learn-begineers/jar/spark-avro_2.12-3.0.1.jar,s3://glue-learn-begineers/jar/hudi-spark3-bundle_2.12-0.9.0.jar -Dspark.glue.JOB_NAME=hudi-test -Dspark.script.location=s3://aws-glue-assets-043916019468-us-west-2/scripts/hudi-test.py -Dspark.files.overwrite=true com.amazonaws.services.glue.ProcessLauncher --launch-class org.apache.spark.deploy.PythonRunner /opt/amazon/bin/runscript.py  /tmp/hudi-test.py true --job-bookmark-option job-bookmark-enable --JOB_ID j_8cd1407036ab38dae74c05008aba629c7beab90ad673660ec0184b6fb1bcef49 true --spark-event-logs-path s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ --JOB_RUN_ID jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df --JOB_NAME hudi-test --TempDir s3://aws-glue-assets-043916019468-us-west-2/temporary/
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Continuous Logging: Creating cloudwatch appender.
Continuous Logging: Creating cloudwatch appender.
Continuous Logging: Creating cloudwatch appender.
Continuous Logging: Creating cloudwatch appender.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: A number (14) of logging calls during the initialization phase have been intercepted and areSLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
SLF4J: See also http://www.slf4j.org/codes.html#replay
2022-12-07 15:58:38,910 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(56)): Unable to read clusterId from http://localhost:8321/configuration, trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json
2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(63)): Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json
2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(71)): Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look
2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Class path contains multiple SLF4J bindings.
2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2022-12-07 15:58:40,320 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN Please initialize the log4j system properly.2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2022-12-07 15:58:41,386 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): ==== Netty Server Started ====
2022-12-07 15:58:41,389 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): input rulesFilePath=/tmp/glue_app_analyzer_rules
2022-12-07 15:58:41,814 INFO [main] glue.SafeLogging (Logging.scala:logInfo(57)): Initializing logging subsystem
2022-12-07 15:58:43,864 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Running Spark version 3.1.1-amzn-0
2022-12-07 15:58:43,911 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ==============================================================
2022-12-07 15:58:43,912 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): No custom resources configured for spark.driver.
2022-12-07 15:58:43,913 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ==============================================================
2022-12-07 15:58:43,913 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Submitted application: nativespark-hudi-test-jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df
2022-12-07 15:58:43,935 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 10240, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2022-12-07 15:58:43,953 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Limiting resource is cpus at 4 tasks per executor
2022-12-07 15:58:43,956 INFO [Thread-12] resource.ResourceProfileManager (Logging.scala:logInfo(57)): Added ResourceProfile id: 0
2022-12-07 15:58:44,032 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls to: spark
2022-12-07 15:58:44,033 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls to: spark
2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls groups to:
2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls groups to:
2022-12-07 15:58:44,035 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
2022-12-07 15:58:44,361 INFO [Thread-12] util.Utils (Logging.scala:logInfo(57)): Successfully started service 'sparkDriver' on port 34581.
2022-12-07 15:58:44,409 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering MapOutputTracker
2022-12-07 15:58:44,446 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMaster
2022-12-07 15:58:44,480 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2022-12-07 15:58:44,481 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): BlockManagerMasterEndpoint up
2022-12-07 15:58:44,486 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMasterHeartbeat
2022-12-07 15:58:44,508 INFO [Thread-12] storage.DiskBlockManager (Logging.scala:logInfo(57)): Created local directory at /tmp/blockmgr-28d77124-2a0a-4c4f-ae8f-d15612884b2a
2022-12-07 15:58:44,543 INFO [Thread-12] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore started with capacity 5.8 GiB
2022-12-07 15:58:44,566 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering OutputCommitCoordinator
2022-12-07 15:58:44,715 INFO [Thread-12] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): JESAsSchedulerBackendEndpoint
2022-12-07 15:58:44,716 INFO [Thread-12] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): JESSchedulerBackend
2022-12-07 15:58:44,723 INFO [Thread-12] scheduler.JESSchedulerBackend (JESClusterManager.scala:<init>(210)): JESClusterManager: Initializing JES client with proxy: host: null, port: -1
2022-12-07 15:58:44,980 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): creating executor task for executor 1; clientToken gr_eb5a0158-1b23-4c96-b077-73337a6fd062_e_1_a_spark-application-1670428724710
2022-12-07 15:58:44,986 INFO [Thread-12] util.Utils (Logging.scala:logInfo(57)): Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35509.
2022-12-07 15:58:44,986 INFO [Thread-12] netty.NettyBlockTransferService (NettyBlockTransferService.scala:init(81)): Server created on 172.36.33.155:35509
2022-12-07 15:58:44,988 INFO [Thread-12] storage.BlockManager (Logging.scala:logInfo(57)): Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2022-12-07 15:58:44,999 INFO [Thread-12] storage.BlockManagerMaster (Logging.scala:logInfo(57)): Registering BlockManager BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,004 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.36.33.155:35509 with 5.8 GiB RAM, BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,009 INFO [Thread-12] storage.BlockManagerMaster (Logging.scala:logInfo(57)): Registered BlockManager BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,010 INFO [Thread-12] storage.BlockManager (Logging.scala:logInfo(57)): Initialized BlockManager: BlockManagerId(driver, 172.36.33.155, 35509, None)
2022-12-07 15:58:45,076 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:<init>(53)): GlueCloudwatchSink: get cloudwatch client using proxy: host null, port -1
2022-12-07 15:58:45,160 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: Obtained credentials from the Instance Profile
2022-12-07 15:58:45,227 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: jobName: hudi-test jobRunId: jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df
2022-12-07 15:58:45,399 INFO [Thread-12] util.log (Log.java:initialized(169)): Logging initialized @10171ms to org.sparkproject.jetty.util.log.Slf4jLog
2022-12-07 15:58:45,523 INFO [Thread-12] history.SingleEventLogFileWriter (Logging.scala:logInfo(57)): Logging events to file:/tmp/spark-event-logs/spark-application-1670428724710.inprogress
2022-12-07 15:58:45,791 INFO [Thread-12] glueexceptionanalysis.EventLogFileWriter (FileWriter.scala:start(51)): Started file writer for com.amazonaws.services.glueexceptionanalysis.EventLogFileWriter
2022-12-07 15:58:45,807 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): createChildTask API response code 200
2022-12-07 15:58:45,809 INFO [allocator] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): executor task g-523d741ce5530b03650113bbb25bb74c68b29f8e created for executor 1
2022-12-07 15:58:45,810 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): creating executor task for executor 2; clientToken gr_eb5a0158-1b23-4c96-b077-73337a6fd062_e_2_a_spark-application-1670428724710
2022-12-07 15:58:45,828 INFO [Thread-12] glueexceptionanalysis.EventLogSocketWriter (SocketWriter.scala:start(34)): Socket client for com.amazonaws.services.glueexceptionanalysis.EventLogSocketWriter started correctly
2022-12-07 15:58:45,866 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Registered listener com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener
2022-12-07 15:58:45,922 INFO [Thread-12] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2022-12-07 15:58:46,002 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): createChildTask API response code 200
2022-12-07 15:58:46,009 INFO [allocator] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): executor task g-5659a8df9288c783a41f61997de7901665d58b93 created for executor 2
2022-12-07 15:58:47,010 INFO [Thread-12] internal.SharedState (Logging.scala:logInfo(57)): Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/tmp/spark-warehouse').
2022-12-07 15:58:47,011 INFO [Thread-12] internal.SharedState (Logging.scala:logInfo(57)): Warehouse path is 'file:/tmp/spark-warehouse'.
2022-12-07 15:58:48,223 INFO [Thread-12] glue.GlueContext (GlueContext.scala:<init>(120)): GlueMetrics configured and enabled
2022-12-07 15:58:54,783 WARN [Thread-12] impl.MetricsConfig (MetricsConfig.java:loadFirst(136)): Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2022-12-07 15:58:54,894 INFO [Thread-12] impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(374)): Scheduled Metric snapshot period at 10 second(s).
2022-12-07 15:58:54,895 INFO [Thread-12] impl.MetricsSystemImpl (MetricsSystemImpl.java:start(191)): s3a-file-system metrics system started
2022-12-07 15:58:55,829 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:initTableAndGetMetaClient(344)): Initializing s3a://glue-learn-begineers/tmp/users as hoodie table s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:03,960 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.34.11.224:53240) with ID 2,  ResourceProfileId 0
2022-12-07 15:59:03,963 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(57)): Got executor added event for 2 @ 1670428743962
2022-12-07 15:59:03,964 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): connected executor 2
2022-12-07 15:59:04,637 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.34.11.224:42111 with 5.8 GiB RAM, BlockManagerId(2, 172.34.11.224, 42111, None)
2022-12-07 15:59:05,498 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:05,942 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:<init>(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties
2022-12-07 15:59:06,103 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:06,103 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:initTableAndGetMetaClient(381)): Finished initializing Table of type COPY_ON_WRITE from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:06,383 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:write(221)): Registered avro schema : {     "type": "record",     "name": "users_record",     "namespace": "hoodie.users",     "fields": [         {             "name": "emp_id",             "type": [                 "null",                 "long"             ],             "default": null         },         {             "name": "employee_name",             "type": [                 "null",                 "string"             ],             "default": null         },         {             "name": "department",             "type": [                 "null",                 "string"             ],             "default": null         },         {             "name": "state",             "type": [                 "null",                 "string"             ],             "default": null         },         {             "name": "salary",             "type": [                 "null",                 "long"             ],             "default": null         },         {             "name": "age",             "type": [                 "null",                 "long"             ],             "default": null         },         {             "name": "bonus",             "type": [                 "null",                 "long"             ],             "default": null         },         {             "name": "ts",             "type": [                 "null",                 "long"             ],             "default": null         }     ] }
2022-12-07 15:59:06,985 INFO [Thread-12] codegen.CodeGenerator (Logging.scala:logInfo(57)): Code generated in 323.430584 ms
2022-12-07 15:59:07,239 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineServerHelper.java:startTimelineService(67)): Starting Timeline service !!
2022-12-07 15:59:07,241 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:setHostAddr(100)): Overriding hostIp to (172.36.33.155) found in spark-conf. It was null
2022-12-07 15:59:07,260 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(232)): Creating View Manager with storage type :MEMORY
2022-12-07 15:59:07,262 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(244)): Creating in-memory based Table View
2022-12-07 15:59:07,287 INFO [Thread-12] util.log (Log.java:initialized(193)): Logging initialized @32060ms to org.apache.hudi.org.apache.jetty.util.log.Slf4jLog
2022-12-07 15:59:07,569 INFO [Thread-12] javalin.Javalin (Javalin.java:start(134)):            __                      __ _          / /____ _ _   __ ____ _ / /(_)____     __  / // __ `/\| \| / // __ `// // // __ \    / /_/ // /_/ / \| \|/ // /_/ // // // / / /    \____/ \__,_/  \|___/ \__,_//_//_//_/ /_/        https://javalin.io/documentation
2022-12-07 15:59:07,571 INFO [Thread-12] javalin.Javalin (Javalin.java:start(139)): Starting Javalin ...
2022-12-07 15:59:07,731 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.34.179.199:52900) with ID 1,  ResourceProfileId 0
2022-12-07 15:59:07,732 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(57)): Got executor added event for 1 @ 1670428747732
2022-12-07 15:59:07,733 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): connected executor 1
2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (JettyServerUtil.kt:initialize(113)): Listening on http://localhost:45551/
2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (Javalin.java:start(149)): Javalin started in 280ms \o/
2022-12-07 15:59:07,843 INFO [Thread-12] service.TimelineService (TimelineService.java:startService(280)): Starting Timeline server on port :45551
2022-12-07 15:59:07,843 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:startServer(95)): Started embedded timeline server at 172.36.33.155:45551
2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncCompactionEnabled(675)): Config.inlineCompactionEnabled ? false
2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncClusteringEnabled(686)): Config.asyncClusteringEnabled ? false
2022-12-07 15:59:07,860 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:08,079 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:<init>(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties
2022-12-07 15:59:08,226 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:08,226 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(128)): Loading Active commit timeline for s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:08,542 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:<init>(115)): Loaded instants upto : Optional.empty
2022-12-07 15:59:08,550 INFO [Thread-12] client.AbstractHoodieWriteClient (AbstractHoodieWriteClient.java:startCommit(714)): Generate a new instant time: 20221207155854 action: commit
2022-12-07 15:59:08,551 INFO [Thread-12] heartbeat.HoodieHeartbeatClient (HoodieHeartbeatClient.java:start(166)): Received request to start heartbeat for instant time 20221207155854
2022-12-07 15:59:08,751 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.34.179.199:34373 with 5.8 GiB RAM, BlockManagerId(1, 172.34.179.199, 34373, None)
2022-12-07 15:59:09,002 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:createNewInstant(144)): Creating a new instant [==>20221207155854__commit__REQUESTED]
2022-12-07 15:59:09,648 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:09,745 INFO [pool-5-thread-1] glue.LogPusher (Logging.scala:logInfo(57)): uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/
2022-12-07 15:59:10,081 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:<init>(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties
2022-12-07 15:59:10,239 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:10,239 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:<init>(128)): Loading Active commit timeline for s3a://glue-learn-begineers/tmp/users
2022-12-07 15:59:10,296 INFO [pool-5-thread-1] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(418)): close closed:false s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/spark-application-1670428724710.inprogress
2022-12-07 15:59:10,540 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:<init>(115)): Loaded instants upto : Option{val=[==>20221207155854__commit__REQUESTED]}
2022-12-07 15:59:10,560 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(232)): Creating View Manager with storage type :REMOTE_FIRST
2022-12-07 15:59:10,565 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(252)): Creating remote first table view
2022-12-07 15:59:10,575 INFO [Thread-12] transaction.TransactionManager (TransactionManager.java:beginTransaction(61)): Latest completed transaction instant Optional.empty
2022-12-07 15:59:10,575 INFO [Thread-12] transaction.TransactionManager (TransactionManager.java:beginTransaction(63)): Transaction starting with transaction owner Option{val=[==>20221207155854__commit__INFLIGHT]}
2022-12-07 15:59:10,576 INFO [Thread-12] lock.LockManager (LockManager.java:getLockProvider(96)): LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
2022-12-07 15:59:10,812 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):  File "/tmp/hudi-test.py", line 114, in <module>    df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save    self._jwrite.save(path)  File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__    answer, self.gateway_client, self.target_id, self.name)  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco    return f(*a, **kw)  File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value    format(target_id, ".", name), value)py4j.protocol.Py4JJavaError: An error occurred while calling o136.save.: org.apache.hudi.exception.HoodieException: Unable to load class at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:58)    at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:90)   at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:101)  at org.apache.hudi.client.transaction.lock.LockManager.getLockProvider(LockManager.java:97) at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:61)    at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:64)   at org.apache.hudi.client.AbstractHoodieWriteClient.preWrite(AbstractHoodieWriteClient.java:402)    at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156)  at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265)  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)  at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)  at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)   at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)  at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)   at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)    at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)   at py4j.Gateway.invoke(Gateway.java:282)    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79)   at py4j.GatewayConnection.run(GatewayConnection.java:238)   at java.lang.Thread.run(Thread.java:750)Caused by: java.lang.ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:387)   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)    at java.lang.Class.forName0(Native Method)  at java.lang.Class.forName(Class.java:348)  at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:54)    ... 48 more
2022-12-07 15:59:10,836 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] {     "Event": "GlueETLJobExceptionEvent",     "Timestamp": 1670428750832,     "Failure Reason": "Traceback (most recent call last):\n  File \"/tmp/hudi-test.py\", line 114, in <module>\n    df.write.format(\"hudi\").options(**hudi_options).mode(\"overwrite\").save(final_base_path)\n  File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 1109, in save\n    self._jwrite.save(path)\n  File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\", line 1305, in __call__\n    answer, self.gateway_client, self.target_id, self.name)\n  File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 111, in deco\n    return f(*a, **kw)\n  File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py\", line 328, in get_return_value\n    format(target_id, \".\", name), value)\npy4j.protocol.Py4JJavaError: An error occurred while calling o136.save.\n: org.apache.hudi.exception.HoodieException: Unable to load class\n\tat org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:58)\n\tat org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:90)\n\tat org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:101)\n\tat org.apache.hudi.client.transaction.lock.LockManager.getLockProvider(LockManager.java:97)\n\tat org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:61)\n\tat org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:64)\n\tat org.apache.hudi.client.AbstractHoodieWriteClient.preWrite(AbstractHoodieWriteClient.java:402)\n\tat org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156)\n\tat org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)\n\tat org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265)\n\tat org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)\n\tat org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)\n\tat org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)\n\tat org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)\n\tat org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)\n\tat org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)\n\tat org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)\n\tat org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: java.lang.ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider\n\tat java.net.URLClassLoader.findClass(URLClassLoader.java:387)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:418)\n\tat sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:351)\n\tat java.lang.Class.forName0(Native Method)\n\tat java.lang.Class.forName(Class.java:348)\n\tat org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:54)\n\t... 48 more\n",     "Stack Trace": [         {             "Declaring Class": "get_return_value",             "Method Name": "format(target_id, \".\", name), value)",             "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",             "Line Number": 328         },         {             "Declaring Class": "deco",             "Method Name": "return f(*a, **kw)",             "File Name": "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",             "Line Number": 111         },         {             "Declaring Class": "__call__",             "Method Name": "answer, self.gateway_client, self.target_id, self.name)",             "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",             "Line Number": 1305         },         {             "Declaring Class": "save",             "Method Name": "self._jwrite.save(path)",             "File Name": "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",             "Line Number": 1109         },         {             "Declaring Class": "<module>",             "Method Name": "df.write.format(\"hudi\").options(**hudi_options).mode(\"overwrite\").save(final_base_path)",             "File Name": "/tmp/hudi-test.py",             "Line Number": 114         }     ],     "Last Executed Line number": 114,     "script": "hudi-test.py" }
2022-12-07 15:59:10,935 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] Last Executed Line number from script hudi-test.py: 114
2022-12-07 15:59:10,938 INFO [main] glue.ResourceManagerSocketWriter (SocketWriter.scala:start(34)): Socket client for com.amazonaws.services.glue.ResourceManagerSocketWriter started correctly
2022-12-07 15:59:10,946 INFO [main] glue.ProcessLauncher (Logging.scala:logInfo(57)): postprocessing
2022-12-07 15:59:10,947 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error stopping analyzer process null
2022-12-07 15:59:10,949 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error terminating analyzer process null
2022-12-07 15:59:10,950 INFO [main] glue.LogPusher (Logging.scala:logInfo(57)): stopping
Continuous Logging: Shutting down cloudwatch appender.
 
Continuous Logging: Shutting down cloudwatch appender.
Continuous Logging: Shutting down cloudwatch appender.
Continuous Logging: Shutting down cloudwatch appender.
2022-12-07 15:59:10,967 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Invoking stop() from shutdown hook
2022-12-07 15:59:10,973 INFO [spark-listener-group-shared] glueexceptionanalysis.EventLogFileWriter (FileWriter.scala:stop(70)): Logs, events processed and insights are written to file /tmp/glue-exception-analysis-logs/spark-application-1670428724710
2022-12-07 15:59:10,980 INFO [shutdown-hook-0] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): Shutting down all executors
2022-12-07 15:59:10,986 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Asking each executor to shut down
2022-12-07 15:59:11,010 INFO [dispatcher-event-loop-0] spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(57)): MapOutputTrackerMasterEndpoint stopped!
2022-12-07 15:59:11,029 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): GlueJobEventsInputHandler Interrupted Last event of a job received. Stop event handling.
2022-12-07 15:59:11,030 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping - analyzer netty server
2022-12-07 15:59:11,037 INFO [shutdown-hook-0] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore cleared
2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopped - analyzer netty server2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Processing remaining events in queue before termination
2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Closing rules engine
2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping analyzer log writer
2022-12-07 15:59:11,037 INFO [shutdown-hook-0] storage.BlockManager (Logging.scala:logInfo(57)): BlockManager stopped
2022-12-07 15:59:11,049 INFO [shutdown-hook-0] storage.BlockManagerMaster (Logging.scala:logInfo(57)): BlockManagerMaster stopped
2022-12-07 15:59:11,053 INFO [dispatcher-event-loop-1] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint (Logging.scala:logInfo(57)): OutputCommitCoordinator stopped!
2022-12-07 15:59:11,061 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Successfully stopped SparkContext
2022-12-07 15:59:11,061 INFO [shutdown-hook-0] glue.LogPusher (Logging.scala:logInfo(57)): uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/
2022-12-07 15:59:11,111 INFO [shutdown-hook-0] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(418)): close closed:false s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/spark-application-1670428724710
2022-12-07 15:59:11,142 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Shutdown hook called
2022-12-07 15:59:11,143 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea
2022-12-07 15:59:11,147 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea/pyspark-677e56c8-2549-4513-a1d9-a96b36052360
2022-12-07 15:59:11,151 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(210)): Stopping s3a-file-system metrics system...
2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(216)): s3a-file-system metrics system stopped.
2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)): s3a-file-system metrics system shutdown complete.

<!--EndFragment-->
</body>
</html>

Message No older events at this moment. Retry Preparing ... Wed Dec 7 15:58:22 UTC 2022 /usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=2 --conf spark.executor.memory=10g --conf spark.executor.cores=8 --conf spark.driver.memory=10g --conf spark.default.parallelism=24 --conf spark.sql.shuffle.partitions=24 --conf spark.network.timeout=600 --enable-glue-datacatalog true --job-bookmark-option job-bookmark-enable --TempDir s3://aws-glue-assets-043916019468-us-west-2/temporary/ --extra-jars s3://glue-learn-begineers/jar/spark-avro_2.12-3.0.1.jar,s3://glue-learn-begineers/jar/hudi-spark3-bundle_2.12-0.9.0.jar --JOB_ID j_8cd1407036ab38dae74c05008aba629c7beab90ad673660ec0184b6fb1bcef49 --enable-metrics true --enable-spark-ui true --spark-event-logs-path s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ --enable-job-insights true --JOB_RUN_ID jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df --additional-python-modules faker==11.3.0 --enable-continuous-cloudwatch-log true --scriptLocation s3://aws-glue-assets-043916019468-us-west-2/scripts/hudi-test.py --job-language python --JOB_NAME hudi-test 1670428703170 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Glue analyzer rules s3://aws-glue-jes-prod-us-west-2-assets/etl/analysis-rules/glue3/

INFO 2022-12-07 15:58:25,461 1 com.amazonaws.services.glue.utils.AWSClientUtils$ [main] AWSClientUtils: create aws log client with conf: proxy host null, proxy port -1 INFO 2022-12-07 15:58:25,689 229 com.amazonaws.services.glue.utils.AWSClientUtils$ [main] AWSClientUtils: getGlueClient. proxy host: null , port: -1 GlueTelemetry: Current region us-west-2 GlueTelemetry: Glue Endpoint https://glue.us-west-2.amazonaws.com GlueTelemetry: Prod env...false GlueTelemetry: is disabled Glue ETL Marketplace - Start ETL connector activation process... Glue ETL Marketplace - no connections are attached to job hudi-test, no need to download custom connector jar for it. Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues. Glue ETL Marketplace - ETL connector activation process finished, container setup continues... Download bucket: glue-learn-begineers key: jar/spark-avro_2.12-3.0.1.jar with usingProxy: false Download bucket: glue-learn-begineers key: jar/hudi-spark3-bundle_2.12-0.9.0.jar with usingProxy: false Download bucket: aws-glue-assets-043916019468-us-west-2 key: scripts/hudi-test.py with usingProxy: false INFO 2022-12-07 15:58:27,947 2487 com.amazonaws.services.glue.PythonModuleInstaller [main] pip3 install --user faker==11.3.0 INFO 2022-12-07 15:58:35,156 9696 com.amazonaws.services.glue.PythonModuleInstaller [main] Collecting faker==11.3.0 Downloading Faker-11.3.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 26.4 MB/s eta 0:00:00Requirement already satisfied: typing-extensions>=3.10.0.2 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (4.4.0)Collecting text-unidecode==1.3 Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.2/78.2 kB 17.4 MB/s eta 0:00:00Requirement already satisfied: python-dateutil>=2.4 in /home/spark/.local/lib/python3.7/site-packages (from faker==11.3.0) (2.8.2)Requirement already satisfied: six>=1.5 in /home/spark/.local/lib/python3.7/site-packages (from python-dateutil>=2.4->faker==11.3.0) (1.16.0)Installing collected packages: text-unidecode, fakerSuccessfully installed faker-11.3.0 text-unidecode-1.3 INFO 2022-12-07 15:58:35,156 9696 com.amazonaws.services.glue.PythonModuleInstaller [main] WARNING: The script faker is installed in '/home/spark/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. Launching ... Wed Dec 7 15:58:35 UTC 2022 /usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar:/tmp/ -Dlog4j.configuration=log4j -server -Xmx10g -XX:+UseG1GC -XX:MaxHeapFreeRatio=70 -XX:InitiatingHeapOccupancyPercent=45 -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseCompressedOops -Djavax.net.ssl.trustStore=/opt/amazon/certs/ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=/opt/amazon/certs/rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=/opt/amazon/certs/redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:/opt/amazon/certs/RDSTrustStore.jks -Dspark.network.timeout=600 -Dspark.metrics.conf..source.jvm.class=org.apache.spark.metrics.source.JvmSource -Dspark.eventLog.enabled=true -Dspark.driver.cores=4 -Dspark.metrics.conf..source.system.class=org.apache.spark.metrics.source.SystemMetricsSource -Dspark.glue.endpoint=https://glue-jes.us-west-2.amazonaws.com -Dspark.default.parallelism=8 -Dspark.sql.parquet.output.committer.class=com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter -Dspark.cloudwatch.logging.conf.jobRunId=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.glue.enable-job-insights=true -Dspark.glueExceptionAnalysisEventLog.dir=/tmp/glue-exception-analysis-logs/ -Dspark.hadoop.mapred.output.direct.EmrFileSystem=true -Dspark.hadoop.aws.glue.endpoint=https://glue.us-west-2.amazonaws.com -Dspark.glueJobInsights.enabled=true -Dspark.glue.USE_PROXY=false -Dspark.eventLog.dir=/tmp/spark-event-logs/ -Dspark.extraListeners=com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener -Dspark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem -Dspark.authenticate.secret= -Dspark.glue.additional-python-modules=faker==11.3.0 -Dspark.metrics.conf..sink.GlueCloudwatch.jobRunId=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.hadoop.mapred.output.direct.NativeS3FileSystem=true -Dspark.pyspark.python=/usr/bin/python3 -Dspark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory -Dspark.cloudwatch.logging.ui.showConsoleProgress=true -Dspark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 -Dspark.hadoop.glue.michiganCredentialsProviderProxy=com.amazonaws.services.glue.remote.LakeformationCredentialsProvider -Dspark.metrics.conf.driver.source.aggregate.class=org.apache.spark.metrics.source.AggregateMetricsSource -Dspark.glue.enable-continuous-cloudwatch-log=true -Dspark.glue.JOB_RUN_ID=jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.sql.catalogImplementation=hive -Dspark.dynamicAllocation.maxExecutors=2 -Dspark.ui.enabled=false -Dspark.driver.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar -Dspark.sql.shuffle.partitions=8 -Dspark.authenticate=true -Dspark.glue.GLUE_TASK_GROUP_ID=eb5a0158-1b23-4c96-b077-73337a6fd062 -Dspark.dynamicAllocation.enabled=false -Dspark.shuffle.service.enabled=false -Dspark.hadoop.mapred.output.committer.class=org.apache.hadoop.mapred.DirectOutputCommitter -Dspark.executor.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar -Dspark.glue.GLUE_VERSION=3.0 -Dspark.executor.cores=4 -Dspark.hadoop.lakeformation.credentials.url=http://localhost:9998/lakeformationcredentials -Dspark.app.name=nativespark-hudi-test-jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df -Dspark.executor.instances=2 -Dspark.metrics.conf..sink.GlueCloudwatch.jobName=hudi-test -Dspark.rpc.askTimeout=600 -Dspark.sql.parquet.fs.optimized.committer.optimization-enabled=true -Dspark.metrics.conf..source.s3.class=org.apache.spark.metrics.source.S3FileSystemSource -Dspark.driver.host=172.36.33.155 -Dspark.metrics.conf..sink.GlueCloudwatch.namespace=Glue -Dspark.pyFiles= -Dspark.executor.memory=10g -Dspark.metrics.conf..sink.GlueCloudwatch.class=org.apache.spark.metrics.sink.GlueCloudwatchSink -Dspark.driver.memory=10g -Dspark.hadoop.fs.s3.buffer.dir=/tmp/hadoop-spark/s3 -Dspark.glue.GLUE_COMMAND_CRITERIA=glueetl -Dspark.master=jes -Dspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false -Dspark.unsafe.sorter.spill.read.ahead.enabled=false -Dspark.hadoop.parquet.enable.summary-metadata=false -Dspark.glueAppInsightsLog.dir=/tmp/glue-app-insights-logs/ -Dspark.glue.jobLanguage=python -Dspark.glue.extra-jars=s3://glue-learn-begineers/jar/spark-avro_2.12-3.0.1.jar,s3://glue-learn-begineers/jar/hudi-spark3-bundle_2.12-0.9.0.jar -Dspark.glue.JOB_NAME=hudi-test -Dspark.script.location=s3://aws-glue-assets-043916019468-us-west-2/scripts/hudi-test.py -Dspark.files.overwrite=true com.amazonaws.services.glue.ProcessLauncher --launch-class org.apache.spark.deploy.PythonRunner /opt/amazon/bin/runscript.py /tmp/hudi-test.py true --job-bookmark-option job-bookmark-enable --JOB_ID j_8cd1407036ab38dae74c05008aba629c7beab90ad673660ec0184b6fb1bcef49 true --spark-event-logs-path s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ --JOB_RUN_ID jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df --JOB_NAME hudi-test --TempDir s3://aws-glue-assets-043916019468-us-west-2/temporary/ SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Continuous Logging: Creating cloudwatch appender. Continuous Logging: Creating cloudwatch appender. Continuous Logging: Creating cloudwatch appender. Continuous Logging: Creating cloudwatch appender. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: A number (14) of logging calls during the initialization phase have been intercepted and are SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.

SLF4J: See also http://www.slf4j.org/codes.html#replay 2022-12-07 15:58:38,910 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(56)): Unable to read clusterId from http://localhost:8321/configuration, trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json 2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(63)): Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json 2022-12-07 15:58:38,961 INFO [main] util.PlatformInfo (PlatformInfo.java:getJobFlowId(71)): Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look 2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Class path contains multiple SLF4J bindings. 2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] 2022-12-07 15:58:40,305 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] 2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Found binding in [jar:file:/opt/amazon/lib/Log4j-slf4j-2.x.jar!/org/slf4j/impl/StaticLoggerBinder.class] 2022-12-07 15:58:40,306 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 2022-12-07 15:58:40,320 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). 2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN Please initialize the log4j system properly. 2022-12-07 15:58:40,667 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 2022-12-07 15:58:41,386 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): ==== Netty Server Started ==== 2022-12-07 15:58:41,389 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): input rulesFilePath=/tmp/glue_app_analyzer_rules 2022-12-07 15:58:41,814 INFO [main] glue.SafeLogging (Logging.scala:logInfo(57)): Initializing logging subsystem 2022-12-07 15:58:43,864 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Running Spark version 3.1.1-amzn-0 2022-12-07 15:58:43,911 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ============================================================== 2022-12-07 15:58:43,912 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): No custom resources configured for spark.driver. 2022-12-07 15:58:43,913 INFO [Thread-12] resource.ResourceUtils (Logging.scala:logInfo(57)): ============================================================== 2022-12-07 15:58:43,913 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Submitted application: nativespark-hudi-test-jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df 2022-12-07 15:58:43,935 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 10240, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 2022-12-07 15:58:43,953 INFO [Thread-12] resource.ResourceProfile (Logging.scala:logInfo(57)): Limiting resource is cpus at 4 tasks per executor 2022-12-07 15:58:43,956 INFO [Thread-12] resource.ResourceProfileManager (Logging.scala:logInfo(57)): Added ResourceProfile id: 0 2022-12-07 15:58:44,032 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls to: spark 2022-12-07 15:58:44,033 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls to: spark 2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing view acls groups to: 2022-12-07 15:58:44,034 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): Changing modify acls groups to: 2022-12-07 15:58:44,035 INFO [Thread-12] spark.SecurityManager (Logging.scala:logInfo(57)): SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set() 2022-12-07 15:58:44,361 INFO [Thread-12] util.Utils (Logging.scala:logInfo(57)): Successfully started service 'sparkDriver' on port 34581. 2022-12-07 15:58:44,409 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering MapOutputTracker 2022-12-07 15:58:44,446 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMaster 2022-12-07 15:58:44,480 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2022-12-07 15:58:44,481 INFO [Thread-12] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): BlockManagerMasterEndpoint up 2022-12-07 15:58:44,486 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering BlockManagerMasterHeartbeat 2022-12-07 15:58:44,508 INFO [Thread-12] storage.DiskBlockManager (Logging.scala:logInfo(57)): Created local directory at /tmp/blockmgr-28d77124-2a0a-4c4f-ae8f-d15612884b2a 2022-12-07 15:58:44,543 INFO [Thread-12] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore started with capacity 5.8 GiB 2022-12-07 15:58:44,566 INFO [Thread-12] spark.SparkEnv (Logging.scala:logInfo(57)): Registering OutputCommitCoordinator 2022-12-07 15:58:44,715 INFO [Thread-12] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): JESAsSchedulerBackendEndpoint 2022-12-07 15:58:44,716 INFO [Thread-12] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): JESSchedulerBackend 2022-12-07 15:58:44,723 INFO [Thread-12] scheduler.JESSchedulerBackend (JESClusterManager.scala:(210)): JESClusterManager: Initializing JES client with proxy: host: null, port: -1 2022-12-07 15:58:44,980 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): creating executor task for executor 1; clientToken gr_eb5a0158-1b23-4c96-b077-73337a6fd062_e_1_a_spark-application-1670428724710 2022-12-07 15:58:44,986 INFO [Thread-12] util.Utils (Logging.scala:logInfo(57)): Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35509. 2022-12-07 15:58:44,986 INFO [Thread-12] netty.NettyBlockTransferService (NettyBlockTransferService.scala:init(81)): Server created on 172.36.33.155:35509 2022-12-07 15:58:44,988 INFO [Thread-12] storage.BlockManager (Logging.scala:logInfo(57)): Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 2022-12-07 15:58:44,999 INFO [Thread-12] storage.BlockManagerMaster (Logging.scala:logInfo(57)): Registering BlockManager BlockManagerId(driver, 172.36.33.155, 35509, None) 2022-12-07 15:58:45,004 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.36.33.155:35509 with 5.8 GiB RAM, BlockManagerId(driver, 172.36.33.155, 35509, None) 2022-12-07 15:58:45,009 INFO [Thread-12] storage.BlockManagerMaster (Logging.scala:logInfo(57)): Registered BlockManager BlockManagerId(driver, 172.36.33.155, 35509, None) 2022-12-07 15:58:45,010 INFO [Thread-12] storage.BlockManager (Logging.scala:logInfo(57)): Initialized BlockManager: BlockManagerId(driver, 172.36.33.155, 35509, None) 2022-12-07 15:58:45,076 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:(53)): GlueCloudwatchSink: get cloudwatch client using proxy: host null, port -1 2022-12-07 15:58:45,160 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: Obtained credentials from the Instance Profile 2022-12-07 15:58:45,227 INFO [Thread-12] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: jobName: hudi-test jobRunId: jr_1643f6c9a18daf0cabf2bcd3fe47356d3656835dda8716cf48e2c6c4e2af75df 2022-12-07 15:58:45,399 INFO [Thread-12] util.log (Log.java:initialized(169)): Logging initialized @10171ms to org.sparkproject.jetty.util.log.Slf4jLog 2022-12-07 15:58:45,523 INFO [Thread-12] history.SingleEventLogFileWriter (Logging.scala:logInfo(57)): Logging events to file:/tmp/spark-event-logs/spark-application-1670428724710.inprogress 2022-12-07 15:58:45,791 INFO [Thread-12] glueexceptionanalysis.EventLogFileWriter (FileWriter.scala:start(51)): Started file writer for com.amazonaws.services.glueexceptionanalysis.EventLogFileWriter 2022-12-07 15:58:45,807 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): createChildTask API response code 200 2022-12-07 15:58:45,809 INFO [allocator] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): executor task g-523d741ce5530b03650113bbb25bb74c68b29f8e created for executor 1 2022-12-07 15:58:45,810 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): creating executor task for executor 2; clientToken gr_eb5a0158-1b23-4c96-b077-73337a6fd062_e_2_a_spark-application-1670428724710 2022-12-07 15:58:45,828 INFO [Thread-12] glueexceptionanalysis.EventLogSocketWriter (SocketWriter.scala:start(34)): Socket client for com.amazonaws.services.glueexceptionanalysis.EventLogSocketWriter started correctly 2022-12-07 15:58:45,866 INFO [Thread-12] spark.SparkContext (Logging.scala:logInfo(57)): Registered listener com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener 2022-12-07 15:58:45,922 INFO [Thread-12] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 2022-12-07 15:58:46,002 INFO [allocator] glue.TaskGroupInterface (Logging.scala:logInfo(57)): createChildTask API response code 200 2022-12-07 15:58:46,009 INFO [allocator] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): executor task g-5659a8df9288c783a41f61997de7901665d58b93 created for executor 2 2022-12-07 15:58:47,010 INFO [Thread-12] internal.SharedState (Logging.scala:logInfo(57)): Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/tmp/spark-warehouse'). 2022-12-07 15:58:47,011 INFO [Thread-12] internal.SharedState (Logging.scala:logInfo(57)): Warehouse path is 'file:/tmp/spark-warehouse'. 2022-12-07 15:58:48,223 INFO [Thread-12] glue.GlueContext (GlueContext.scala:(120)): GlueMetrics configured and enabled 2022-12-07 15:58:54,783 WARN [Thread-12] impl.MetricsConfig (MetricsConfig.java:loadFirst(136)): Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 2022-12-07 15:58:54,894 INFO [Thread-12] impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(374)): Scheduled Metric snapshot period at 10 second(s). 2022-12-07 15:58:54,895 INFO [Thread-12] impl.MetricsSystemImpl (MetricsSystemImpl.java:start(191)): s3a-file-system metrics system started 2022-12-07 15:58:55,829 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:initTableAndGetMetaClient(344)): Initializing s3a://glue-learn-begineers/tmp/users as hoodie table s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:03,960 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.34.11.224:53240) with ID 2, ResourceProfileId 0 2022-12-07 15:59:03,963 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(57)): Got executor added event for 2 @ 1670428743962 2022-12-07 15:59:03,964 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): connected executor 2 2022-12-07 15:59:04,637 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.34.11.224:42111 with 5.8 GiB RAM, BlockManagerId(2, 172.34.11.224, 42111, None) 2022-12-07 15:59:05,498 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:05,942 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties 2022-12-07 15:59:06,103 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:06,103 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:initTableAndGetMetaClient(381)): Finished initializing Table of type COPY_ON_WRITE from s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:06,383 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:write(221)): Registered avro schema : { "type": "record", "name": "users_record", "namespace": "hoodie.users", "fields": [ { "name": "emp_id", "type": [ "null", "long" ], "default": null }, { "name": "employee_name", "type": [ "null", "string" ], "default": null }, { "name": "department", "type": [ "null", "string" ], "default": null }, { "name": "state", "type": [ "null", "string" ], "default": null }, { "name": "salary", "type": [ "null", "long" ], "default": null }, { "name": "age", "type": [ "null", "long" ], "default": null }, { "name": "bonus", "type": [ "null", "long" ], "default": null }, { "name": "ts", "type": [ "null", "long" ], "default": null } ] }

2022-12-07 15:59:06,985 INFO [Thread-12] codegen.CodeGenerator (Logging.scala:logInfo(57)): Code generated in 323.430584 ms 2022-12-07 15:59:07,239 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineServerHelper.java:startTimelineService(67)): Starting Timeline service !! 2022-12-07 15:59:07,241 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:setHostAddr(100)): Overriding hostIp to (172.36.33.155) found in spark-conf. It was null 2022-12-07 15:59:07,260 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(232)): Creating View Manager with storage type :MEMORY 2022-12-07 15:59:07,262 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(244)): Creating in-memory based Table View 2022-12-07 15:59:07,287 INFO [Thread-12] util.log (Log.java:initialized(193)): Logging initialized @32060ms to org.apache.hudi.org.apache.jetty.util.log.Slf4jLog 2022-12-07 15:59:07,569 INFO [Thread-12] javalin.Javalin (Javalin.java:start(134)):

      / /____ _ _   __ ____ _ / /(_)____
 __  / // __ `/| | / // __ `// // // __ \
/ /_/ // /_/ / | |/ // /_/ // // // / / /
\____/ \__,_/  |___/ \__,_//_//_//_/ /_/

    https://javalin.io/documentation

2022-12-07 15:59:07,571 INFO [Thread-12] javalin.Javalin (Javalin.java:start(139)): Starting Javalin ... 2022-12-07 15:59:07,731 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.34.179.199:52900) with ID 1, ResourceProfileId 0 2022-12-07 15:59:07,732 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(57)): Got executor added event for 1 @ 1670428747732 2022-12-07 15:59:07,733 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(57)): connected executor 1 2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (JettyServerUtil.kt:initialize(113)): Listening on http://localhost:45551/ 2022-12-07 15:59:07,843 INFO [Thread-12] javalin.Javalin (Javalin.java:start(149)): Javalin started in 280ms \o/ 2022-12-07 15:59:07,843 INFO [Thread-12] service.TimelineService (TimelineService.java:startService(280)): Starting Timeline server on port :45551 2022-12-07 15:59:07,843 INFO [Thread-12] embedded.EmbeddedTimelineService (EmbeddedTimelineService.java:startServer(95)): Started embedded timeline server at 172.36.33.155:45551 2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncCompactionEnabled(675)): Config.inlineCompactionEnabled ? false 2022-12-07 15:59:07,859 INFO [Thread-12] hudi.HoodieSparkSqlWriter$ (HoodieSparkSqlWriter.scala:isAsyncClusteringEnabled(686)): Config.asyncClusteringEnabled ? false 2022-12-07 15:59:07,860 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:08,079 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties 2022-12-07 15:59:08,226 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:08,226 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(128)): Loading Active commit timeline for s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:08,542 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:(115)): Loaded instants upto : Optional.empty 2022-12-07 15:59:08,550 INFO [Thread-12] client.AbstractHoodieWriteClient (AbstractHoodieWriteClient.java:startCommit(714)): Generate a new instant time: 20221207155854 action: commit 2022-12-07 15:59:08,551 INFO [Thread-12] heartbeat.HoodieHeartbeatClient (HoodieHeartbeatClient.java:start(166)): Received request to start heartbeat for instant time 20221207155854 2022-12-07 15:59:08,751 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(57)): Registering block manager 172.34.179.199:34373 with 5.8 GiB RAM, BlockManagerId(1, 172.34.179.199, 34373, None) 2022-12-07 15:59:09,002 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:createNewInstant(144)): Creating a new instant [==>20221207155854commitREQUESTED] 2022-12-07 15:59:09,648 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(105)): Loading HoodieTableMetaClient from s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:09,745 INFO [pool-5-thread-1] glue.LogPusher (Logging.scala:logInfo(57)): uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ 2022-12-07 15:59:10,081 INFO [Thread-12] table.HoodieTableConfig (HoodieTableConfig.java:(170)): Loading table properties from s3a://glue-learn-begineers/tmp/users/.hoodie/hoodie.properties 2022-12-07 15:59:10,239 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(125)): Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:10,239 INFO [Thread-12] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(128)): Loading Active commit timeline for s3a://glue-learn-begineers/tmp/users 2022-12-07 15:59:10,296 INFO [pool-5-thread-1] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(418)): close closed:false s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/spark-application-1670428724710.inprogress 2022-12-07 15:59:10,540 INFO [Thread-12] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:(115)): Loaded instants upto : Option{val=[==>20221207155854commitREQUESTED]} 2022-12-07 15:59:10,560 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(232)): Creating View Manager with storage type :REMOTE_FIRST 2022-12-07 15:59:10,565 INFO [Thread-12] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(252)): Creating remote first table view 2022-12-07 15:59:10,575 INFO [Thread-12] transaction.TransactionManager (TransactionManager.java:beginTransaction(61)): Latest completed transaction instant Optional.empty 2022-12-07 15:59:10,575 INFO [Thread-12] transaction.TransactionManager (TransactionManager.java:beginTransaction(63)): Transaction starting with transaction owner Option{val=[==>20221207155854commitINFLIGHT]} 2022-12-07 15:59:10,576 INFO [Thread-12] lock.LockManager (LockManager.java:getLockProvider(96)): LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider 2022-12-07 15:59:10,812 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last): File "/tmp/hudi-test.py", line 114, in df.write.format("hudi").options(*hudi_options).mode("overwrite").save(final_base_path) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save self._jwrite.save(path) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o136.save. : org.apache.hudi.exception.HoodieException: Unable to load class at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:58) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:90) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:101) at org.apache.hudi.client.transaction.lock.LockManager.getLockProvider(LockManager.java:97) at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:61) at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:64) at org.apache.hudi.client.AbstractHoodieWriteClient.preWrite(AbstractHoodieWriteClient.java:402) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:54) ... 48 more

2022-12-07 15:59:10,836 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] { "Event": "GlueETLJobExceptionEvent", "Timestamp": 1670428750832, "Failure Reason": "Traceback (most recent call last):\n File \"/tmp/hudi-test.py\", line 114, in \n df.write.format(\"hudi\").options(hudi_options).mode(\"overwrite\").save(final_base_path)\n File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 1109, in save\n self._jwrite.save(path)\n File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\", line 1305, in call\n answer, self.gateway_client, self.target_id, self.name)\n File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 111, in deco\n return f(*a, *kw)\n File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py\", line 328, in get_return_value\n format(target_id, \".\", name), value)\npy4j.protocol.Py4JJavaError: An error occurred while calling o136.save.\n: org.apache.hudi.exception.HoodieException: Unable to load class\n\tat org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:58)\n\tat org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:90)\n\tat org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:101)\n\tat org.apache.hudi.client.transaction.lock.LockManager.getLockProvider(LockManager.java:97)\n\tat org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:61)\n\tat org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:64)\n\tat org.apache.hudi.client.AbstractHoodieWriteClient.preWrite(AbstractHoodieWriteClient.java:402)\n\tat org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156)\n\tat org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)\n\tat org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265)\n\tat org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)\n\tat org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)\n\tat org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)\n\tat org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)\n\tat org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)\n\tat org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)\n\tat org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)\n\tat org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: java.lang.ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider\n\tat java.net.URLClassLoader.findClass(URLClassLoader.java:387)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:418)\n\tat sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)\n\tat java.lang.ClassLoader.loadClass(ClassLoader.java:351)\n\tat java.lang.Class.forName0(Native Method)\n\tat java.lang.Class.forName(Class.java:348)\n\tat org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:54)\n\t... 48 more\n", "Stack Trace": [ { "Declaring Class": "get_return_value", "Method Name": "format(target_id, \".\", name), value)", "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", "Line Number": 328 }, { "Declaring Class": "deco", "Method Name": "return f(a, kw)", "File Name": "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", "Line Number": 111 }, { "Declaring Class": "call", "Method Name": "answer, self.gateway_client, self.target_id, self.name)", "File Name": "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", "Line Number": 1305 }, { "Declaring Class": "save", "Method Name": "self._jwrite.save(path)", "File Name": "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", "Line Number": 1109 }, { "Declaring Class": "", "Method Name": "df.write.format(\"hudi\").options(**hudi_options).mode(\"overwrite\").save(final_base_path)", "File Name": "/tmp/hudi-test.py", "Line Number": 114 } ], "Last Executed Line number": 114, "script": "hudi-test.py" }

2022-12-07 15:59:10,935 ERROR [main] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] Last Executed Line number from script hudi-test.py: 114 2022-12-07 15:59:10,938 INFO [main] glue.ResourceManagerSocketWriter (SocketWriter.scala:start(34)): Socket client for com.amazonaws.services.glue.ResourceManagerSocketWriter started correctly 2022-12-07 15:59:10,946 INFO [main] glue.ProcessLauncher (Logging.scala:logInfo(57)): postprocessing 2022-12-07 15:59:10,947 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error stopping analyzer process null 2022-12-07 15:59:10,949 INFO [main] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Analyzer: Error terminating analyzer process null 2022-12-07 15:59:10,950 INFO [main] glue.LogPusher (Logging.scala:logInfo(57)): stopping Continuous Logging: Shutting down cloudwatch appender.

Continuous Logging: Shutting down cloudwatch appender. Continuous Logging: Shutting down cloudwatch appender. Continuous Logging: Shutting down cloudwatch appender. 2022-12-07 15:59:10,967 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Invoking stop() from shutdown hook 2022-12-07 15:59:10,973 INFO [spark-listener-group-shared] glueexceptionanalysis.EventLogFileWriter (FileWriter.scala:stop(70)): Logs, events processed and insights are written to file /tmp/glue-exception-analysis-logs/spark-application-1670428724710 2022-12-07 15:59:10,980 INFO [shutdown-hook-0] scheduler.JESSchedulerBackend (Logging.scala:logInfo(57)): Shutting down all executors 2022-12-07 15:59:10,986 INFO [dispatcher-CoarseGrainedScheduler] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(57)): Asking each executor to shut down 2022-12-07 15:59:11,010 INFO [dispatcher-event-loop-0] spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(57)): MapOutputTrackerMasterEndpoint stopped! 2022-12-07 15:59:11,029 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): GlueJobEventsInputHandler Interrupted Last event of a job received. Stop event handling. 2022-12-07 15:59:11,030 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping - analyzer netty server 2022-12-07 15:59:11,037 INFO [shutdown-hook-0] memory.MemoryStore (Logging.scala:logInfo(57)): MemoryStore cleared 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopped - analyzer netty server 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Processing remaining events in queue before termination 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Closing rules engine 2022-12-07 15:59:11,037 INFO [Thread-9] glue.AnalyzerLogHelper$ (Logging.scala:logInfo(24)): Stopping analyzer log writer 2022-12-07 15:59:11,037 INFO [shutdown-hook-0] storage.BlockManager (Logging.scala:logInfo(57)): BlockManager stopped 2022-12-07 15:59:11,049 INFO [shutdown-hook-0] storage.BlockManagerMaster (Logging.scala:logInfo(57)): BlockManagerMaster stopped 2022-12-07 15:59:11,053 INFO [dispatcher-event-loop-1] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint (Logging.scala:logInfo(57)): OutputCommitCoordinator stopped! 2022-12-07 15:59:11,061 INFO [shutdown-hook-0] spark.SparkContext (Logging.scala:logInfo(57)): Successfully stopped SparkContext 2022-12-07 15:59:11,061 INFO [shutdown-hook-0] glue.LogPusher (Logging.scala:logInfo(57)): uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/ 2022-12-07 15:59:11,111 INFO [shutdown-hook-0] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(418)): close closed:false s3://aws-glue-assets-043916019468-us-west-2/sparkHistoryLogs/spark-application-1670428724710 2022-12-07 15:59:11,142 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Shutdown hook called 2022-12-07 15:59:11,143 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea 2022-12-07 15:59:11,147 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)): Deleting directory /tmp/spark-4c587f79-b77e-4212-bf54-95b93594deea/pyspark-677e56c8-2549-4513-a1d9-a96b36052360 2022-12-07 15:59:11,151 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(210)): Stopping s3a-file-system metrics system... 2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(216)): s3a-file-system metrics system stopped. 2022-12-07 15:59:11,152 INFO [shutdown-hook-0] impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)): s3a-file-system metrics system shutdown complete.

soumilshah1995 commented 2 years ago

@nsivabalan confirmed that it was not reproducible with 0.12.1 in our Jenkins test infra.

i have added more details for you guys 🗡️

soumilshah1995 commented 2 years ago

When using Glue 4.0 which natively supports Hudi now

Link https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html

Now GLue 4.0 supports hudi out of box but when trying with DynamoDB it fails i tried with glue 3.0 as well using custom jar file both didn't work

soumilshah1995 commented 2 years ago

Any updates ?

codope commented 1 year ago

@soumilshah1995 What you're hitting is the ClassNotFoundException: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider. If you use the right bundle of hudi, this class should be present. In OSS, for making use of DynamoDBBasedLockProvider, you need to have hudi-aws-bundle in classpath as well. This looks like the default glue setup issue and I would encourage you to raise an AWS support ticket. We have been running continuous tests with different lock providers for multi-writer scenarios successfully.

soumilshah1995 commented 1 year ago

Hi not sure if you are aware glue 4 does not require you to specify class path or jar files hudi is not natively supported in glue 4.0. feel free to give a try if needed i have provided detailed steps with instruction to replicate the error. i would appreciate the help if someone can try these and let me know exactly what needs to be done @nsivabalan

soumilshah1995 commented 1 year ago

closing this ticket :D

soumilshah1995 commented 1 year ago

Any one is curios here is code that works

try:
    import os
    import sys
    import uuid

    import boto3

    import pyspark
    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, asc, desc
    from awsglue.utils import getResolvedOptions
    from awsglue.dynamicframe import DynamicFrame
    from awsglue.context import GlueContext

    from faker import Faker

    print("All modules are loaded .....")

except Exception as e:
    print("Some modules are missing {} ".format(e))

# ----------------------------------------------------------------------------------------
#                 Settings
# -----------------------------------------------------------------------------------------

database_name1 = "hudidb"
table_name = "hudi_table"
base_s3_path = "s3a://glue-learn-begineers"
final_base_path = "{base_s3_path}/{table_name}".format(
    base_s3_path=base_s3_path, table_name=table_name
)
curr_session = boto3.session.Session()
curr_region = curr_session.region_name

# ----------------------------------------------------------------------------------------------------
global faker
faker = Faker()

class DataGenerator(object):

    @staticmethod
    def get_data():
        return [
            (
                x,
                faker.name(),
                faker.random_element(elements=('IT', 'HR', 'Sales', 'Marketing')),
                faker.random_element(elements=('CA', 'NY', 'TX', 'FL', 'IL', 'RJ')),
                faker.random_int(min=10000, max=150000),
                faker.random_int(min=18, max=60),
                faker.random_int(min=0, max=100000),
                faker.unix_time()
            ) for x in range(5)
        ]

def create_spark_session():
    spark = SparkSession \
        .builder \
        .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
        .config('spark.sql.hive.convertMetastoreParquet','false') \
        .config('spark.sql.legacy.pathOptionBehavior.enabled', 'true') \
        .getOrCreate()
    return spark

spark = create_spark_session()
sc = spark.sparkContext
glueContext = GlueContext(sc)

"""
CHOOSE ONE 
"hoodie.datasource.write.storage.type": "MERGE_ON_READ",
"hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
"""

hudi_options = {
    'hoodie.table.name': table_name,
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    'hoodie.datasource.write.recordkey.field': 'emp_id',
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'state',

    'hoodie.datasource.hive_sync.enable': 'true',
    "hoodie.datasource.hive_sync.mode":"hms",
    'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
    'hoodie.datasource.hive_sync.database': database_name1,
    'hoodie.datasource.hive_sync.table': table_name,
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.write.hive_style_partitioning': 'true',

    'hoodie.write.concurrency.mode' : 'optimistic_concurrency_control'
    ,'hoodie.cleaner.policy.failed.writes' : 'LAZY'
    ,'hoodie.write.lock.provider' : 'org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider'
    ,'hoodie.write.lock.dynamodb.table' : 'hudi-blog-lock-table'
    ,'hoodie.write.lock.dynamodb.partition_key' : 'tablename'
    ,'hoodie.write.lock.dynamodb.region' : '{0}'.format(curr_region)
    ,'hoodie.write.lock.dynamodb.endpoint_url' : 'dynamodb.{0}.amazonaws.com'.format(curr_region)
    ,'hoodie.write.lock.dynamodb.billing_mode' : 'PAY_PER_REQUEST'
    ,'hoodie.bulkinsert.shuffle.parallelism': 2000

}

# ====================================================
"""Create Spark Data Frame """
# ====================================================
data = DataGenerator.get_data()

columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
df = spark.createDataFrame(data=data, schema=columns)
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(final_base_path)

# ====================================================
"""APPEND """
# ====================================================

impleDataUpd = [
    (6, "This is APPEND", "Sales", "RJ", 81000, 30, 23000, 827307999),
    (7, "This is APPEND", "Engineering", "RJ", 79000, 53, 15000, 1627694678),
]

columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)

# ====================================================
"""UPDATE """
# ====================================================
impleDataUpd = [
    (3, "this is update on data lake", "Sales", "RJ", 81000, 30, 23000, 827307999),
]
columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)

# ====================================================
"""SOFT DELETE """
# ====================================================
# from pyspark.sql.functions import lit
# from functools import reduce
#
#
# print("\n")
# soft_delete_ds  = spark.sql("SELECT * FROM hudidb.hudi_table_rt where emp_id='4' ")
# print(soft_delete_ds.show())
# print("\n")
#
# # prepare the soft deletes by ensuring the appropriate fields are nullified
# meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_record_key","_hoodie_partition_path", "_hoodie_file_name"]
# excluded_columns = meta_columns + ["ts", "emp_id", "partitionpath"]
# nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, list(map(lambda field: (field.name, field.dataType), soft_delete_ds.schema.fields))))
#
# soft_delete_df = reduce(lambda df, col: df.withColumn(col[0], lit(None).cast(col[1])),
#                         nullify_columns, reduce(lambda df,col: df.drop(col[0]), meta_columns, soft_delete_ds))
#
#
# soft_delete_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)
#
#
#

# # ====================================================
# """HARD DELETE """
# # ====================================================
#
# ds = spark.sql("SELECT * FROM hudidb.hudi_table_rt where emp_id='2' ")
#
# hudi_hard_delete_options = {
#     'hoodie.table.name': table_name,
#     'hoodie.datasource.write.recordkey.field': 'emp_id',
#     'hoodie.datasource.write.table.name': table_name,
#     'hoodie.datasource.write.operation': 'delete',
#     'hoodie.datasource.write.precombine.field': 'ts',
#     'hoodie.upsert.shuffle.parallelism': 2,
#     'hoodie.insert.shuffle.parallelism': 2
# }
#
# ds.write.format("hudi").options(**hudi_hard_delete_options).mode("append").save(final_base_path)

nsivabalan commented 2 months ago

sorry, did we find the root cause why it was failing earlier and later we got it resolved. did newer version of hudi worked? or any config changes were needed.

apache / hudi

[SUPPORT] DyanmoDB Lock provider throws error while writing #7286

HUDI Dynamodb Concurrency Control

====================================================

====================================================

====================================================

====================================================

impleDataUpd = [

(3, "xxx", "Sales", "RJ", 81000, 30, 23000, 827307999),

(7, "x change", "Engineering", "RJ", 79000, 53, 15000, 1627694678),

]

columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]

usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)

usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)

====================================================

final_read_df = spark.read.format("hudi").load(final_base_path)

final_read_df.createOrReplaceTempView("hudi_users_view")

glueContext.purge_s3_path(target_s3_path,{"retentionPeriod": 0, "excludeStorageClasses": ["STANDARD_IA"]} )

spark.sql(f"CREATE DATABASE IF NOT EXISTS hudi_demo")

spark.sql(f"DROP TABLE IF EXISTS hudi_demo.hudi_users")

spark.sql(f"CREATE TABLE IF NOT EXISTS hudi_demo.hudi_users USING PARQUET LOCATION '{target_s3_path}' as (SELECT * from hudi_users_view)")

When using Glue 4.0 which natively supports Hudi now