Closed Arun-kc closed 2 years ago
@Arun-kc It feels like a connection problem, please check hoodie.datasource.hive_sync.jdbcurl, it seems to be a default value now
@Carl-Zhou-CN The following is the hudi options I'm using as of now.
hudiOptions = {
"hoodie.table.name": "my_hudi_table",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.partitionpath.field": "creation_date",
"hoodie.datasource.write.precombine.field": "last_update_time",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.table": "my_hudi_table",
"hoodie.datasource.hive_sync.partition_fields": "creation_date",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.index.type": "GLOBAL_BLOOM", # This is required if we want to ensure we upsert a record, even if the partition changes
"hoodie.bloom.index.update.partition.path": "true", # This is required to write the data into the new partition (defaults to false in 0.8.0, true in 0.9.0)
}
As for hoodie.datasource.hive_sync.jdbcurl
, I'm not using any hive as of now, so what URL should I mention?
I'm doing this in AWS Glue and using a hudi connector.
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.table": "my_hudi_table",
"hoodie.datasource.hive_sync.partition_fields": "creation_date",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
@Arun-kc If you do not register your Hudi dataset as a table in the Hive metastore, these options are not required.
Because of your hudi version, you may need to manually update the partition after writing ALTER TABLE table_name RECOVER PARTITIONS;
I do not know how to fix this for Glue because it hides all nodes from management. But I know how to fix this error for EMR. The source article is - https://aws.amazon.com/ru/blogs/big-data/apply-record-level-changes-from-relational-databases-to-amazon-s3-data-lake-using-apache-hudi-on-amazon-emr-and-aws-database-migration-service/
See the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o116.save.
: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/
at org.apache.hudi.hive.HoodieHiveClient.createHiveConnection(HoodieHiveClient.java:553)
It means that all nodes inside the cluster try to connect to localhost e.g, themselves and fail.
The solution for EMR
Call ListInstances with EMR ClusterId and InstanceGroupTypes MASTER. Then grab PrivateIpAddress (Json path is $.Instances[0].PrivateIpAddress
). And path this as hudi config parameter:
--hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://111.111.111.111:10000
With this, all cluster nodes will connect to the master and sync table.
Couple of notes: 1) I used hudi version 0.7 from amazon and hive/glue catalog sync worked without any problems. But when I move to 0.9.0 I see no new partitions. I just changed the version nothing else. Also another application with new 0.9.0 version needs IP address manipulation. 2) I can not say how my fixes can be applied to glue job. Sorry. Try to connect to aws support and tell them that you need to get master node IP address before submit a job. Something tells me to run some code to get IP addresses and add hudi config programmatically - but is it possible to access glue job master node IP? I do not know. :-(
hoodie.datasource.hive_sync.use_jdbc -> false Try not to connect to metastore through jdbc? This might help
@Carl-Zhou-CN
I tried ALTER TABLE table_name RECOVER PARTITIONS;
, but its not working.
hoodie.datasource.hive_sync.use_jdbc -> false Tried this approach too, but to no vain.
@nikita-sheremet-clearscale Yes, I'm using Glue in this scenario. I'm using a hudi connector that was subscribed when the version was 0.5.1. Now in marketplace the version is shown as 0.9.0. I'm not sure if the subscribed version gets updated automatically.
I will check on the IP part and will let you know.
Just to let you know, the hudi table I'm creating it manually in Athena using the following DDL
CREATE EXTERNAL TABLE `my_hudi_table`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`id` string,
`last_update_time` string)
PARTITIONED BY (
`creation_date` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://<BUCKET>/tmp/myhudidataset_001'
@Arun-kc Sorry, it seems I misunderstood,what needs to be done should be ALTER TABLE ADD PARTITION
@Carl-Zhou-CN It's ok.
I have tried ALTER TABLE ADD PARTITION
before, it does work. But we will have to specify the partitions manually. When there are a lot of partitions this is not a viable solution, until and unless we can automate it. I will have to create a script to do this using boto3, that's doable.
What I was trying to do is letting the Hudi system do this on its own so that in Athena we can query the partitions directly without running any other queries. Is it possible?
I think it is possible, but I am not familiar with Athena. I think that as long as Hudi can interact with Glue Catalog, your problem should be solved. You may need to ask others to help. @nsivabalan Do you have time to help?
Hi @Carl-Zhou-CN and @nikita-sheremet-clearscale
I tried the same with hudi connector version 0.9.0
and it's working fine now. The partition is getting reflected in Athena.
Seems the problem was with the hudi connector version 0.5.1
.
Thanks for the help both of you 🙌 I'm closing this issue
thanks for the update.
@Arun-kc can you update what config settings ended up working with Glue and hudi version 0.9.0 please?
CC @bhasudha @rajkalluri: for doc updates if any wrt version compatabilities.
Describe the problem you faced
Partitioned data is not getting reflected in AWS Glue catalog (Athena table)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Partition values should be reflected in Glue catalog in Athena
Environment Description
Hudi version : 0.51
Spark version : 2.4
Hive version : NA
Hadoop version : NA
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Trying to update partition values as mentioned in this article by @dacort
Athena table DDL is as follows
Stacktrace