apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.48k stars 2.44k forks source link

[SUPPORT] Glue 3.0 with HUDI marketplace Connector #7459

Closed soumilshah1995 closed 1 year ago

soumilshah1995 commented 1 year ago

Glue Version 3.0

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.session import SparkSession

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
spark = SparkSession.builder.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer').config(
    'spark.sql.hive.convertMetastoreParquet', 'false').getOrCreate()

sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
curr_session = boto3.session.Session()
curr_region = curr_session.region_name

# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1671035205024 = glueContext.create_dynamic_frame.from_catalog(
    database="dev.dynamodbdb",
    table_name="dev_users",
    transformation_ctx="AWSGlueDataCatalog_node1671035205024",
)

# Script generated for node Rename Field
RenameField_node1671035214862 = RenameField.apply(
    frame=AWSGlueDataCatalog_node1671035205024,
    old_name="id",
    new_name="pk",
    transformation_ctx="RenameField_node1671035214862",
)

# Script generated for node Change Schema (Apply Mapping)
ChangeSchemaApplyMapping_node1671035230105 = ApplyMapping.apply(
    frame=RenameField_node1671035214862,
    mappings=[
        ("address", "string", "address", "string"),
        ("city", "string", "city", "string"),
        ("last_name", "string", "last_name", "string"),
        ("text", "string", "text", "string"),
        ("pk", "string", "pk", "string"),
        ("state", "string", "state", "string"),
        ("first_name", "string", "first_name", "string"),
    ],
    transformation_ctx="ChangeSchemaApplyMapping_node1671035230105",
)

# Script generated for node Apache Hudi Connector 0.10.1 for AWS Glue 3.0
ApacheHudiConnector0101forAWSGlue30_node1671035245333 = (
    glueContext.write_dynamic_frame.from_options(
        frame=ChangeSchemaApplyMapping_node1671035230105,
        connection_type="marketplace.spark",
        connection_options={
            "path": "s3://glue-learn-begineers/hudi/",
            "connectionName": "hudi-connection",

            'hoodie.table.name': "hudi_table",
            'hoodie.datasource.write.recordkey.field': 'pk',
            'hoodie.datasource.write.table.name': "hudi_table",
            'hoodie.datasource.write.operation': 'upsert',
            'hoodie.datasource.write.precombine.field': 'first_name',

            'hoodie.datasource.hive_sync.enable': 'true',
            "hoodie.datasource.hive_sync.mode":"hms",
            'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
            'hoodie.datasource.hive_sync.database': "testdb",
            'hoodie.datasource.hive_sync.table': "hudi_table",
            'hoodie.datasource.hive_sync.use_jdbc': 'false',
            'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
            'hoodie.datasource.write.hive_style_partitioning': 'true',

            'hoodie.write.concurrency.mode' : 'optimistic_concurrency_control'
            ,'hoodie.cleaner.policy.failed.writes' : 'LAZY'
            ,'hoodie.write.lock.provider' : 'org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider'
            ,'hoodie.write.lock.dynamodb.table' : 'hudi-blog-lock-table'
            ,'hoodie.write.lock.dynamodb.partition_key' : 'tablename'
            ,'hoodie.write.lock.dynamodb.region' : '{0}'.format(curr_region)
            ,'hoodie.write.lock.dynamodb.endpoint_url' : 'dynamodb.{0}.amazonaws.com'.format(curr_region)
            ,'hoodie.write.lock.dynamodb.billing_mode' : 'PAY_PER_REQUEST'
            ,'hoodie.bulkinsert.shuffle.parallelism': 2000

        },
        transformation_ctx="ApacheHudiConnector0101forAWSGlue30_node1671035245333",
    )
)

job.commit()

Error Message An error occurred while calling o128.pyWriteDynamicFrame. Failed to upsert for commit time 20221214165008336

Detailed o/p Logs

022-12-14 16:50:33,933 main WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassNotFoundException: org.apache.logging.log4j.core.lookup.JndiLookup
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at org.apache.logging.log4j.util.LoaderUtil.loadClass(LoaderUtil.java:173)
    at org.apache.logging.log4j.util.LoaderUtil.newInstanceOf(LoaderUtil.java:211)
    at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:232)
    at org.apache.logging.log4j.core.util.Loader.newCheckedInstanceOf(Loader.java:301)
    at org.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:95)
    at org.apache.logging.log4j.core.config.AbstractConfiguration.<init>(AbstractConfiguration.java:114)
    at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:55)
    at org.apache.logging.log4j.core.layout.PatternLayout$Builder.build(PatternLayout.java:430)
    at org.apache.logging.log4j.core.layout.PatternLayout.createDefaultLayout(PatternLayout.java:324)
    at org.apache.logging.log4j.core.appender.ConsoleAppender$Builder.<init>(ConsoleAppender.java:121)
    at org.apache.logging.log4j.core.appender.ConsoleAppender.newBuilder(ConsoleAppender.java:111)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.createBuilder(PluginBuilder.java:158)
    at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:119)
    at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:813)
    at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:753)
    at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:745)
    at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:389)
    at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:169)
    at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:181)
    at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:446)
    at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:520)
    at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:536)
    at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:214)
    at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:146)
    at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
    at org.apache.logging.log4j.LogManager.getContext(LogManager.java:194)
    at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:597)
    at org.apache.spark.metrics.sink.MetricsConfigUtils.<clinit>(MetricsConfigUtils.java:12)
    at org.apache.spark.metrics.sink.MetricsProxyInfo.fromConfig(MetricsProxyInfo.java:17)
    at com.amazonaws.services.glue.cloudwatch.CloudWatchLogsAppenderCommon.<init>(CloudWatchLogsAppenderCommon.java:62)
    at com.amazonaws.services.glue.cloudwatch.CloudWatchLogsAppenderCommon$CloudWatchLogsAppenderCommonBuilder.build(CloudWatchLogsAppenderCommon.java:79)
    at com.amazonaws.services.glue.cloudwatch.CloudWatchAppender.activateOptions(CloudWatchAppender.java:73)
    at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
    at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
    at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
    at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
    at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
    at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
    at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
    at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
    at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
    at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
    at org.slf4j.impl.Log4jLoggerFactory.<init>(Log4jLoggerFactory.java:66)
    at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72)
    at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45)
    at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
    at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
    at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
    at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
    at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383)
    at org.apache.spark.network.util.JavaUtils.<clinit>(JavaUtils.java:41)
    at org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:67)
    at org.apache.spark.internal.config.ConfigBuilder.$anonfun$bytesConf$1(ConfigBuilder.scala:259)
    at org.apache.spark.internal.config.ConfigBuilder.$anonfun$bytesConf$1$adapted(ConfigBuilder.scala:259)
    at org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$transform$1(ConfigBuilder.scala:101)
    at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:144)
    at org.apache.spark.internal.config.package$.<init>(package.scala:345)
    at org.apache.spark.internal.config.package$.<clinit>(package.scala)
    at org.apache.spark.SparkConf$.<init>(SparkConf.scala:654)
    at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
    at org.apache.spark.SparkConf.set(SparkConf.scala:94)
    at org.apache.spark.SparkConf.$anonfun$loadFromSystemProperties$3(SparkConf.scala:76)
    at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:788)
    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:230)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:461)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:461)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:787)
    at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:75)
    at org.apache.spark.SparkConf.<init>(SparkConf.scala:70)
    at org.apache.spark.SparkConf.<init>(SparkConf.scala:59)
    at com.amazonaws.services.glue.SparkProcessLauncherPlugin.getSparkConf(ProcessLauncher.scala:41)
    at com.amazonaws.services.glue.SparkProcessLauncherPlugin.getSparkConf$(ProcessLauncher.scala:40)
    at com.amazonaws.services.glue.ProcessLauncher$$anon$1.getSparkConf(ProcessLauncher.scala:78)
    at com.amazonaws.services.glue.ProcessLauncher.<init>(ProcessLauncher.scala:84)
    at com.amazonaws.services.glue.ProcessLauncher.<init>(ProcessLauncher.scala:78)
    at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:29)
    at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

Error Logs

2022-12-14 16:50:39,105 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] broadcast.TorrentBroadcast (Logging.scala:logInfo(57)): Reading broadcast variable 3 took 15 ms
2022-12-14 16:50:39,120 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] memory.MemoryStore (Logging.scala:logInfo(57)): Block broadcast_3 stored as values in memory (estimated size 40.0 B, free 5.8 GiB)
2022-12-14 16:50:39,133 ERROR [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] executor.Executor (Logging.scala:logError(94)): Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:819)
    at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458)
    at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61)
    at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
    at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:643)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
2022-12-14 16:50:39,249 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)): Got assigned task 1
2022-12-14 16:50:39,250 INFO [Executor task launch worker for task 0.1 in stage 0.0 (TID 1)] executor.Executor (Logging.scala:logInfo(57)): Running task 0.1 in stage 0.0 (TID 1)
2022-12-14 16:50:39,255 ERROR [Executor task launch worker for task 0.1 in stage 0.0 (TID 1)] executor.Executor (Logging.scala:logError(94)): Exception in task 0.1 in stage 0.0 (TID 1)
java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:819)
    at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458)
    at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61)
    at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
    at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:643)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
2022-12-14 16:50:39,267 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)): Got assigned task 2
2022-12-14 16:50:39,268 INFO [Executor task launch worker for task 0.2 in stage 0.0 (TID 2)] executor.Executor (Logging.scala:logInfo(57)): Running task 0.2 in stage 0.0 (TID 2)
2022-12-14 16:50:39,274 ERROR [Executor task launch worker for task 0.2 in stage 0.0 (TID 2)] executor.Executor (Logging.scala:logError(94)): Exception in task 0.2 in stage 0.0 (TID 2)
java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:819)
    at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458)
    at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61)
    at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
    at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:643)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
2022-12-14 16:50:39,288 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)): Got assigned task 3
2022-12-14 16:50:39,289 INFO [Executor task launch worker for task 0.3 in stage 0.0 (TID 3)] executor.Executor (Logging.scala:logInfo(57)): Running task 0.3 in stage 0.0 (TID 3)
2022-12-14 16:50:39,294 ERROR [Executor task launch worker for task 0.3 in stage 0.0 (TID 3)] executor.Executor (Logging.scala:logError(94)): Exception in task 0.3 in stage 0.0 (TID 3)
java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:819)
    at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458)
    at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61)
    at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
    at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:643)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

I am also happy to show the steps on call if needed. By having this correctly I am going to make tutorials for the community. If I am facing the issue and having a hard time meaning a lot of people must have had this issue if you or anyone can help that allows me to teach others as well.

This is a tutorial I am making on how to move data from dynamodb into Apache Hudi a hands-on session. my email id shahsoumil519@gmail.com

soumilshah1995 commented 1 year ago

also wanted to add

if i am trying to do count

print("******count******")
print(AWSGlueDataCatalog_node1671035205024.count())
print("************")

image

Then i getting different Error

An error occurred while calling o112.count. java.lang.NullPointerException

Also tried in Glue Notebooks here is notebook for your ref https://github.com/soumilshah1995/code-snippets/blob/main/hudi-notebook%20(2).ipynb

Its so weird and difficult to get this thing working correctly

umehrot2 commented 1 year ago

@soumilshah1995 the error mentioned in the notebook is different from the DynamoDb issue stack trace you have provided here ? I am confused, about which issue you are currently blocked on. If its reading from DynamoDb as is from the stack trace provided here, this is not the right forum for that issue (its not related to Hudi). You should reach out to AWS Glue support for it.

soumilshah1995 commented 1 year ago

i did perform a lot of test and here is my honest unbiased feedback. i tried reading data from various catlog it works fine for most of sources when working with dynamodb it throws error. IE i guess problem might be on AWS side and not HUDI i have already contacted support for this issue shall keep you posted.

Steps Created dynamodb table > inserted Fake data > created Glue DB and populated Schema via crawler and then when i used glue to read data from catlog it throws weird error as said my guess is problem might be on AWS Side

soumilshah1995 commented 1 year ago

closing this ticket as its resolved after speaking to support :D cheers