[SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

wosow commented 3 years ago

Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables , no errors happening

Environment Description

Hudi version :0.6.0
Spark version : 2.4.4
Hive version : 2.3.7
Hadoop version : 2.7.5
Storage (HDFS/S3/GCS..) : HDFS
Running on Docker? (yes/no) : no

code as follows: batchDF.write.format("org.apache.hudi") .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "MERGE_ON_READ") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert") .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, "10") .option("hoodie.datasource.compaction.async.enable", "true") .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rec_id") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "modified") .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "ads") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, hiveTableName) .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dt") .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") .option(HoodieWriteConfig.TABLE_NAME, hiveTableName) .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2://0.0.0.0:10000") .option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, "") .option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, "") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[MultiPartKeysValueExtractor].getName) .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name()) .option("hoodie.insert.shuffle.parallelism", "10") .option("hoodie.upsert.shuffle.parallelism", "10") .mode("append") .save("/data/mor/user")

only create user_ro , no user_rt

bvaradar commented 3 years ago

can you copy the contents of hoodie.properties of the dataset here ?

wosow commented 3 years ago

can you copy the contents of hoodie.properties of the dataset here ? hoodie.properties as follows: hoodie.zip

bvaradar commented 3 years ago

It is indeed a MOR table.Can you check your driver logs. You might find some exceptions around registering _rt table. You can look for logs around the log message

"Trying to sync hoodie table "

wosow commented 3 years ago

It is indeed a MOR table.Can you check your driver logs. You might find some exceptions around registering _rt table. You can look for logs around the log message

"Trying to sync hoodie table "

error as follows: there is no sql about creating _rt table , only _ro table

----------------------------------------------------------------------------------------------------------------------------------------------
21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 371
21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 337
21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 404
21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev03:18850 in memory (size: 73.0 KB, free: 2.5 GB)
21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev02:6815 in memory (size: 73.0 KB, free: 3.5 GB)
21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 333
21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 418
21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 385
21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 410
21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
21/01/07 13:23:10 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
21/01/07 13:23:10 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
21/01/07 13:23:10 INFO ObjectStore: Initialized ObjectStore
21/01/07 13:23:11 INFO HiveMetaStore: Added admin role in metastore
21/01/07 13:23:11 INFO HiveMetaStore: Added public role in metastore
21/01/07 13:23:11 INFO HiveMetaStore: No user is added in admin role, since config is empty
21/01/07 13:23:11 INFO HiveMetaStore: 0: get_all_databases
21/01/07 13:23:11 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_all_databases   
21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=default pat=*
21/01/07 13:23:11 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
21/01/07 13:23:11 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=dw pat=*
21/01/07 13:23:11 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_functions: db=dw pat=*  
21/01/07 13:23:11 INFO HiveSyncTool: Trying to sync hoodie table api_trade_ro with base path /data/stream/mor/api_trade of type MERGE_ON_READ
21/01/07 13:23:11 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
21/01/07 13:23:11 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : db=ads tbl=api_trade_ro 
21/01/07 13:23:11 INFO HoodieHiveClient: Found the last compaction commit as Option{val=null}
21/01/07 13:23:11 INFO HoodieHiveClient: Found the last delta commit Option{val=[20210107132154__deltacommit__COMPLETED]}
21/01/07 13:23:12 INFO HoodieHiveClient: Reading schema from /data/stream/mor/api_trade/dt=2021-01/350a9a01-538c-4a7e-8c17-09d2cdc85073-0_0-20-85_20210107132154.parquet
21/01/07 13:23:12 INFO HiveSyncTool: Hive table api_trade_ro is not found. Creating it
21/01/07 13:23:12 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
21/01/07 13:23:12 INFO HoodieHiveClient: Executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
21/01/07 13:24:09 INFO HiveSyncTool: Schema sync complete. Syncing partitions for api_trade_ro
21/01/07 13:24:09 INFO HiveSyncTool: Last commit time synced was found to be null
21/01/07 13:24:09 INFO HoodieHiveClient: Last commit time synced is not known, listing all partitions in /data/stream/mor/api_trade,FS :DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-398057260_1, ugi=root (auth:SIMPLE)]]
21/01/07 13:24:09 INFO HiveSyncTool: Storage partitions scan complete. Found 1
21/01/07 13:24:09 INFO HiveMetaStore: 0: get_partitions : db=ads tbl=api_trade_ro
21/01/07 13:24:09 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_partitions : db=ads tbl=api_trade_ro    
21/01/07 13:24:10 INFO HiveSyncTool: New Partitions [dt=2021-01]
21/01/07 13:24:10 INFO HoodieHiveClient: Adding partitions 1 to table api_trade_ro
21/01/07 13:24:10 INFO HoodieHiveClient: Executing SQL ALTER TABLE `ads`.`api_trade_ro` ADD IF NOT EXISTS   PARTITION (`dt`='2021-01') LOCATION '/data/stream/mor/api_trade/dt=2021-01' 
21/01/07 13:24:33 INFO HiveSyncTool: Changed Partitions []
21/01/07 13:24:33 INFO HoodieHiveClient: No partitions to change for api_trade_ro
21/01/07 13:24:33 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
21/01/07 13:24:33 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : db=ads tbl=api_trade_ro 
21/01/07 13:24:33 ERROR HiveSyncTool: Got runtime exception when hive syncing
org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210107132154
    at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:658)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:128)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:91)
    at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
    at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
    at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:196)
    at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:163)
    at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: NoSuchObjectException(message:ads.api_trade_ro table not found)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
    at com.sun.proxy.$Proxy41.get_table(Unknown Source)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
    at com.sun.proxy.$Proxy42.getTable(Unknown Source)
    at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:654)
    ... 48 more
21/01/07 13:24:33 INFO HiveMetaStore: 0: Shutting down the object store...
21/01/07 13:24:33 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=Shutting down the object store...   
21/01/07 13:24:33 INFO HiveMetaStore: 0: Metastore shutdown complete.
21/01/07 13:24:33 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=Metastore shutdown complete.    
21/01/07 13:24:33 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> ads, hoodie.insert.shuffle.parallelism -> 10, path -> /data/stream/mor/api_trade, hoodie.datasource.write.precombine.field -> modified, hoodie.datasource.hive_sync.partition_fields -> dt, hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.MultiPartKeysValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> api_trade, hoodie.index.type -> GLOBAL_BLOOM, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> true, hoodie.datasource.write.recordkey.field -> id, hoodie.table.name -> api_trade, hoodie.datasource.hive_sync.jdbcurl -> jdbc:hive2://0.0.0.0:10000, hoodie.datasource.write.table.type -> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.datasource.query.type -> snapshot, hoodie.bloom.index.update.partition.path -
----------------------------------------------------------------------------------------------------------------------------------------------

bvaradar commented 3 years ago

@wosow : The _rt table syncing happens after _ro table and I see an HiveMetaStore exception when updating commit time in the _ro table saying that the table does not exist. This is weird as in the few log messages above, I can see that the _ro table is registered. Somehow _ro table is not visible to HiveMetaStoreClient. I think it is likely that HiveServer and HiveMetastore are not setup correctly and there could be more than one HiveMetastore involded (one probably local ) instance.

n3nash commented 3 years ago

@wosow Were you able to resolve your issue ?

nsivabalan commented 3 years ago

@wosow : also, few quick questions as we triage the issue.

Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug.
Is this affecting your production? trying to gauge the severity.
Or you are trying out a POC ? and this is the first time trying out Hudi.

wosow commented 3 years ago

@wosow : also, few quick questions as we triage the issue.

Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug.

Is this affecting your production? trying to gauge the severity.

Or you are trying out a POC ? and this is the first time trying out Hudi.

```

There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not performed in the 0.7.0. In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! !

wosow commented 3 years ago

@wosow Were you able to resolve your issue ?

no

n3nash commented 3 years ago

@wosow looks like this is not a real issue in production. For your questions on async compaction, have you taken a look at this blog -> https://hudi.apache.org/blog/async-compaction-deployment-model/ ? If your questions are still unanswered after reading this blog, please ping here and we will answer them

n3nash commented 3 years ago

@wosow Did you get a chance to read the blog ? Please let us know if this issue is still valid

n3nash commented 3 years ago

Closing this ticket due to inactivity. @wosow Please feel free to re-open if you need more information.

dude0001 commented 3 years ago

I'm seeing the same error, and the same symptom of only the _ro table is created. I am using this with AWS Glue (managed) ETL jobs and trying to sync the Hudi metadata to Glue Data Catalog. This happens the first time I run my job and it is trying to create the tables in the Glue Data Catalog. I suspect there is a permissions issue or there is schema evolution being detected that isn't supported with my setup. I was initially trying to use an MoR dataset when hitting this error. I've tried using CoW instead and hit the same error.

I'm getting an additional error" java.lang.UnsupportedOperationException: Table rename is not supported"

I'm just trying a PoC but we are pretty hot on using this as it solves a lot of our problems nicely.

Environment Description

Hudi version: 0.8.0
Spark version : 2.4.3
Hive version : 2.4.3 (?)
Hadoop version : 2.8.5
Storage (HDFS/S3/GCS..) : S3 EMRFS
Running on Docker? (yes/no) : ? (I'm using AWS Glue (Managed) ETL, not positive)

2021-07-30 10:06:08,704 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File "/tmp/raw-to-staging.py", line 146, in <module>
    main()
  File "/tmp/raw-to-staging.py", line 137, in main
    .save("s3://myBucket/Staging/mySourceDB/mySchema/myTable/")
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 734, in save
    self._jwrite.save(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o416.save.
: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing ChargeActive
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
    at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
    at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436)
    at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210730100502
    at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:496)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:168)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:112)
    ... 40 more
Caused by: java.lang.UnsupportedOperationException: Table rename is not supported
    at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:515)
    at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
    at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:385)
    at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:494)
    ... 42 more

nsivabalan commented 3 years ago

@dude0001 : just to confirm, you are also facing the issue only w/ spark streaming is it?

nsivabalan commented 3 years ago

@rmahindra123 : Do you mind taking a look at this. let's sync up sometime today.

dude0001 commented 3 years ago

@nsivabalan that is one difference in my duplication steps. I am currently not using a Spark Streaming job. I'm reading from our raw zone in S3 that contains parquet files containing change data capture events from transactional databases. I'm trying to upset our cleansed zone also in S3 so that it contains the latest version of each row. If I turn off sync, it works fine otherwise.

nsivabalan commented 3 years ago

@dude0001 : Hey if you don't mind, can you create a new github issue. do not want to pollute this issue. since yours is not spark streaming. We can add a link to this issue calling it out as related. also, since this is related to glu catalog, I can CC some aws folks and ask them to help us out.

nsivabalan commented 3 years ago

oh, btw COW does not have two tables. only MOR has two tables.

nsivabalan commented 3 years ago

@dude0001 : Did you open up a new github issue? usually table rename exception happens if your table name in hudi mismatches w/ that in hive. Is there any case sensitivity that could be an issue? If your table name has capital letters, can you try all small letters.

dude0001 commented 3 years ago

I apologize, I did not open a new issue as I've been in meetings and working on production stories. I got back to my POC this morning, and prove your theory correct. Renaming my table all lower case resolved my issue! To your point, I'm not sure my rename exception is the same as the original issue that may or may not be related to streaming. Would you still like me to open an issue for the table rename exception? One change might be to add a warning in Hudi for this scenario. Or adding documentation somewhere to make syncing Hudi metadata to hive in AWS Glue Data Catalog a little less painful. I'm happy to do it, or just move on. Please let me know, and thank you again for the help!

nsivabalan commented 3 years ago

thanks, we will add an faq shortly.

apache / hudi

[SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables #2409