apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.48k stars 2.44k forks source link

[SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore #10231

Closed soumilshah1995 closed 11 months ago

soumilshah1995 commented 1 year ago

Hello everyone, I'm encountering a small issue that seems to be related to settings, and I would appreciate any guidance in identifying the problem. This pertains to my upcoming videos where I'm covering the Hudi Hive Sync tool in detail. I've started the Spark Thrift Server using the following command:


spark-submit \
  --master 'local[*]' \
  --conf spark.executor.extraJavaOptions=-Duser.timezone=Etc/UTC \
  --conf spark.eventLog.enabled=false \
  --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
  --name "Thrift JDBC/ODBC Server" \
  --executor-memory 512m \
  --packages org.apache.spark:spark-hive_2.12:3.4.0

Additionally, I have Beeline installed and connected to the default database:

beeline -u jdbc:hive2://localhost:10000/default

While my delta stream works fine, it appears that I'm facing issues using it with the Hive MetaStore. Here's my Spark submit command for the Hudi Delta Streamer:

spark-submit \
    --class org.apache.hudi.utilities.streamer.HoodieStreamer \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' \
    --repositories 'https://repo.maven.apache.org/maven2' \
    --properties-file /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/spark-config.properties \
    --master 'local[*]' \
    --executor-memory 1g \
     /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar  \
    --table-type COPY_ON_WRITE \
    --op UPSERT \
    --enable-hive-sync \
    --source-ordering-field ts \
    --source-class org.apache.hudi.utilities.sources.CsvDFSSource \
    --target-base-path file:///Users/soumilshah/Downloads/hudidb/ \
    --target-table orders \
    --props hudi_tbl.props

Hudi CONF

hoodie.datasource.write.recordkey.field=order_id
hoodie.datasource.write.partitionpath.field=order_date
hoodie.streamer.source.dfs.root=file:////Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders
hoodie.datasource.write.precombine.field=ts

hoodie.deltastreamer.csv.header=true
hoodie.deltastreamer.csv.sep=\t

hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.mode=jdbc
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000
hoodie.datasource.hive_sync.database=default
hoodie.datasource.hive_sync.table=orders
hoodie.datasource.hive_sync.partition_fields=order_date

Spark Conf:

spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.sql.hive.convertMetastoreParquet=false

The error I'm encountering is:

Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
    at org.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:606)
    at org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3385)

Any assistance in identifying what might be missing or misconfigured would be highly appreciated. Thank you!

ad1happy2go commented 1 year ago

@soumilshah1995 Have you configured the external metastore? We need to setup the hive metastore tables.

ad1happy2go commented 1 year ago

You can find the hive scripts here - https://github.com/apache/hive/tree/master/metastore/scripts/upgrade

soumilshah1995 commented 1 year ago

@ad1happy2go "Could you assist me in setting up Derby Megastore? I believe I've made some progress, but I'm a bit confused about the steps involved. I would really appreciate it if we could connect on Slack and go through it together. This presents a great opportunity for me to learn and, in turn, share this knowledge with others."

ad1happy2go commented 1 year ago

Sure @soumilshah1995. let's connect tomorrow morning on the same. Thanks.

soumilshah1995 commented 12 months ago

connected with Aditya I will do some test myself and if after that I am not able to solve issue I will reach out to Aditya also I am on PTO this week might not get time to try stuff

ad1happy2go commented 12 months ago

@soumilshah1995 Thanks for the update. Let me know in case you need any help.

soumilshah1995 commented 12 months ago

Subject: Need Assistance with Apache Hudi Hive Sync Configuration Issue

I'm currently encountering an issue with Hive sync while using Apache Hudi with Apache Derby as the backend. I have Apache Derby running locally, and I'm seeking assistance in identifying any missing configurations in my setup.

Here are the relevant configuration files: Spark Configuration (spark-config.properties):

spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.hive.convertMetastoreParquet=false
#hive.metastore.uris=jdbc:derby://localhost:1527/MyDatabase;create=true

Hudi Configuration:

hoodie.datasource.write.recordkey.field=order_id
hoodie.datasource.write.partitionpath.field=order_date
hoodie.streamer.source.dfs.root=file:////Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders
hoodie.datasource.write.precombine.field=ts

hoodie.deltastreamer.csv.header=true
hoodie.deltastreamer.csv.sep=\t

hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.datasource.hive_sync.mode=hms
hoodie.datasource.hive_sync.jdbcurl=jdbc:derby://localhost:1527/MyDatabase;create=true
hoodie.datasource.hive_sync.database=MyDatabase
hoodie.datasource.hive_sync.hive.metastore.auto.create.all=true
hoodie.datasource.hive_sync.schema=MyDatabase
hoodie.datasource.hive_sync.table=orders
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.hive_sync.partition_fields=order_date
hoodie.datasource.write.hive_style_partitioning=true

Spark Submit Command:

spark-submit \
    --class org.apache.hudi.utilities.streamer.HoodieStreamer \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0' \
    --properties-file /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/spark-config.properties \
    --master 'local[*]' \
    --executor-memory 1g \
     /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar  \
    --table-type COPY_ON_WRITE \
    --op UPSERT \
    --enable-hive-sync \
    --source-ordering-field ts \
    --source-class org.apache.hudi.utilities.sources.CsvDFSSource \
    --target-base-path file:///Users/soumilshah/Downloads/hudidb/ \
    --target-table orders \
    --props hudi_tbl.props

Error Message

/12/07 13:40:37 WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MTableColumnStatistics and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
    at org.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:606)

Apache DERBY works fine

Screenshot 2023-12-07 at 1 48 04 PM

Screenshot 2023-12-07 at 1 48 29 PM

Even after trying both Hive sync modes (HMS and JDBC), the issue persists. The process works fine without Hive sync. Could someone please assist me in identifying the problem or any missing configurations?

Thanks in advance for your help!

soumilshah1995 commented 12 months ago

Updating

I read a old slack post where https://apache-hudi.slack.com/archives/C4D716NPQ/p1699659668014099

Where it was suggested to add this in properties files

spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.hive.convertMetastoreParquet=false

spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
# or, for embedded Derby
# spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.EmbeddedDriver
spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/MyDatabase?createDatabaseIfNotExist=true&allowPublicKeyRetrieval=true
datanucleus.schema.autoCreateTables=true

That throws a different error


Nested Throwables StackTrace:
java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby://0.0.0.0:3306/MyDatabase?createDatabaseIfNotExist=true&allowPublicKeyRetrieval=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: No suitable driver found for jdbc:derby://0.0.0.0:3306/MyDatabase?createDatabaseIfNotExist=true&allowPublicKeyRetrieval=true
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:702)
    at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
    at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
    at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:416)
    at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)
    at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:483)
    at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:297)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
    at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
    at org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133)
    at org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422)
    at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817)
    at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:334)
    at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:213)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at javax.jdo.JDOHelper$16.run(JDOHelper.java:1975)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at javax.jdo.JDOHelper.invoke(JDOHelper.java:1970)
    at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1177)
    at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:814)
    at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:702)
    at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:521)
    at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:550)
    at org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:405)
    at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342)
    at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:79)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:139)
    at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58)
    at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStoreForConf(HiveMetaStore.java:628)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMSForConf(HiveMetaStore.java:594)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:588)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:659)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:431)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:79)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92)
    at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6902)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:162)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:70)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1740)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:83)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
    at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3607)
    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3659)
    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3639)
    at org.apache.hudi.hive.util.IMetaStoreClientUtil.getMSC(IMetaStoreClientUtil.java:40)
    at org.apache.hudi.hive.HoodieHiveSyncClient.<init>(HoodieHiveSyncClient.java:89)
    at org.apache.hudi.hive.HiveSyncTool.initSyncClient(HiveSyncTool.java:123)
    at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:117)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:79)
    at org.apache.hudi.sync.common.util.SyncUtilHelpers.instantiateMetaSyncTool(SyncUtilHelpers.java:108)
    at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:78)
    at org.apache.hudi.utilities.streamer.StreamSync.runMetaSync(StreamSync.java:938)
    at org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:851)
    at org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:446)
    at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:840)
    at org.apache.hudi.utilities.ingestion.HoodieIngestionService.startIngestion(HoodieIngestionService.java:72)
    at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
    at org.apache.hudi.utilities.streamer.HoodieStreamer.sync(HoodieStreamer.java:205)
    at org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:584)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
------

    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at com.jolbox.bonecp.PoolUtil.generateSQLException(PoolUtil.java:192)
    at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:422)
    at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)
    at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:483)
    at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:297)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
    at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
    at org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133)