Closed rakeshramakrishnan closed 2 years ago
@satishkotha : Can you help with this ?
@rakeshramakrishnan From logs, I do see table default
.hive_hudi_sync
is created correctly and available in catalog
25064 [Thread-5] INFO org.apache.hudi.hive.HoodieHiveClient - Time taken to execute [CREATE EXTERNAL TABLE IF NOT EXISTS default
.hive_hudi_sync
( _hoodie_commit_time
string, _hoodie_commit_seqno
string, _hoodie_record_key
string, _hoodie_partition_path
string, _hoodie_file_name
string, begin_lat
double, begin_lon
double, driver
string, end_lat
double, end_lon
double, fare
double, rider
string, ts
double, uuid
string) PARTITIONED BY (partitionpath
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hive_hudi_sync']: 317 ms
spark.catalog.listTables() Table(name='hive_hudi_sync'
Why do you think table is not available? I don't see any other errors in logs you shared.
@satishkotha: I can see the table created in the hive table of the local spark catalog but not in the remote hive metastore. There are no error logs.
@bvaradar @satishkotha : Will the PR #2449 address this issue? However the PR seems to be for the hive sync standalone tool. Or does the hive sync within hudi write use the same module?
@bvaradar @satishkotha : Will the PR #2449 address this issue? However the PR seems to be for the hive sync standalone tool. Or does the hive sync within hudi write use the same module? I'm making adjustments, wait for it to be completely better, you can try.
@rakeshramakrishnan Could you try the above patch from @Trevor-zhang and see if that fixes your issue ?
@n3nash The PR #2449 is closed now. Is there any other PR that tracks this issue?
@rakeshramakrishnan : would be nice if you can respond w/ any recent updates.
@rakeshramakrishnan : if I not wrong, hive sync w/ metastore have been working (anecdotally from community. ) w/ hudi. So, may be some jar mismatch issue. Even w/o the aforementioned patch (#2449), it was working before. 2449 just adds explicit configs. Prior to this hive sync uses properties from Hadoop conf. that's the only difference. As satish mentioned, we don't see any errors in the log attached.
Can you get us a full stack trace if possible.
@nsivabalan : There are no errors, however through hudi, the connection is made to the local hive metastore (from spark). It doesn't connect to the external hive metastore.
But, without hudi, the spark catalog fetches tables hive tables from the external metastore
spark = SparkSession.builder \
.appName("test-hudi-hive-sync") \
.enableHiveSupport() \
.config("hive.metastore.uris", metastore_uri) \
.getOrCreate()
print("Before {}".format(spark.catalog.listTables())) ------> returns tables from `metastore_uri`
@rakeshramakrishnan For hive sync to work inline through Hudi, the hive-site.xml at
I tried to reproduce with a remote MySQL database as metastore. My jdbc specific configs in hive-site.xml look like as follows:
"javax.jdo.option.ConnectionURL": "jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true",
"javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
"javax.jdo.option.ConnectionUserName": "username",
"javax.jdo.option.ConnectionPassword": "password"
Then the following pyspark script works:
pyspark \
> --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
> --conf "spark.sql.hive.convertMetastoreParquet=false" \
> --jars /home/hadoop/hudi-spark3-bundle_2.12-0.10.0-SNAPSHOT.jar,/usr/lib/spark/external/lib/spark-avro.jar
...
...
Using Python version 3.7.9 (default, Aug 27 2020 21:59:41)
SparkSession available as 'spark'.
>>> from pyspark.sql import functions as F
>>>
>>> inputDF = spark.createDataFrame([
... ("100", "2015/01/01", "2015-01-01T13:51:39.340396Z"),
... ("101", "2015/01/01", "2015-01-01T12:14:58.597216Z"),
... ("102", "2015/01/01", "2015-01-01T13:51:40.417052Z"),
... ("103", "2015/01/01", "2015-01-01T13:51:40.519832Z"),
... ("104", "2015/01/02", "2015-01-01T12:15:00.512679Z"),
... ("105", "2015/01/02", "2015-01-01T13:51:42.248818Z")],
... ["id", "creation_date", "last_update_time"])
>>>
>>> hudiOptions = {
... "hoodie.table.name" : "hudi_hive_table",
... "hoodie.datasource.write.table.type" : "COPY_ON_WRITE",
... "hoodie.datasource.write.operation" : "insert",
... "hoodie.datasource.write.recordkey.field" : "id",
... "hoodie.datasource.write.partitionpath.field" : "creation_date",
... "hoodie.datasource.write.precombine.field" : "last_update_time",
... "hoodie.datasource.hive_sync.enable" : "true",
... "hoodie.datasource.hive_sync.table" : "hudi_hive_table",
... "hoodie.datasource.hive_sync.partition_fields" : "creation_date"
... }
>>>
>>> inputDF.write.format("org.apache.hudi").options(**hudiOptions).mode("overwrite").save("s3://huditestbkt/hive_sync/")
21/09/29 10:22:08 WARN HoodieSparkSqlWriter$: hoodie table at s3://huditestbkt/hive_sync already exists. Deleting existing data & overwriting with new data.
21/09/29 10:22:34 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
>>>
Had the same issue, using Scala, Spark and DataSourceWriteOptions.HIVE_SYNC_MODE.key() -> "hms"
.
Adding a hive-site.xml
with the URL to a src/main/resources
folder fixed it for me.
If this is intended, maybe should be added to the documentation? Feels a bit weird that you specify a URL DataSourceWriteOptions.HIVE_URL
but it has no effect?
@matthiasdg Could help me?
I dont understand your solution src/main/resources
I dont regonize this path
@rubenssoto You have to make sure the hive-site.xml can be found on the classpath. For java, scala projects you typically use resources folders for that. Not sure what/how your project is...
@matthiasdg
It is a python project and my Hive site is inside spark classpath
But I keep receiving this error from hudi:
Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
at org.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:606)
But this table exist on my metastore database.
@rubenssoto : can you confirm that all connection configs are intact in your set up.
these are the ones that worked for sagar.
"javax.jdo.option.ConnectionURL": "jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true", "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver", "javax.jdo.option.ConnectionUserName": "username", "javax.jdo.option.ConnectionPassword": "password"
alternatively you can also try "hms" mode instead of jdbc. I will let @codope follow up from here.
Will go ahead and close this one out as we have a solution proposed. Feel free to re-open if you are still encountering issues.
Describe the problem you faced Unable to sync to external hive metastore via thrift protocol. Instead the sync seems to happen with the local hive store.
To Reproduce Run pyspark file as below which does the following
hive.metastore.uris
using the thrift protocol and prints the existing tables: to show that the existing setup is able to connect to the metastore without any issuesspark.catalog.listTables()
HiveMetastoreConnection version 1.2.1 using Spark classes
. Have tried connecting to the hive metastore using spark3.0.1
and hive2.3.7
jars and able to list the tables in the external metastore. However, unable to use it with hudi0.6.0
, and hence used spark2.4.7
for the below example.Expected behavior
hive_hudi_sync
to show up in the external hive metastore after hive syncEnvironment Description
Additional context Have attached the run logs:
org.apache.spark
because they were adding to noise. If I need to attach it, do let me know.