Open Kuntal-G opened 6 years ago
Thank you for identifying the issue. I repeated the problem. It looks like there is a bug here: https://github.com/awslabs/aws-glue-samples/blob/fd8cab884e6f636be37f677cbfd7db7f6e9fc6ab/utilities/Hive_metastore_migration/src/hive_metastore_migration.py#L809
It parses partition keys and values from Glue into Hive partition names. The generated partName is like "year(string),month(string)=2015,02" whereas Hive expects something like "year=2015/month=02".
I found that "year(string),month(string)=2015,02" actually works on "DESCRIBE test" query on Hive 1.0.0, so the author might wrote code based on this version of Hive. But "year=2015/month=02" is the standard format, so I'll push a bug fix for it.
HI @Kuntal-G Thanks for replicating and posting the issue on our behalf.
Hi @dichenli ,
Thanks for picking this up. We are kind of blocked right now . Can you provide ETA for this ?
Really appreciate your help on this.
Thanks, Niloy
I apologize that I can't provide an ETA yet. A thorough fix may take some time to be pushed to GitHub, but to unblock yourself immediately, you may paste a quick fix code snippet below to replace the function with problem. It should work most of the time.
@staticmethod
def udf_partition_name_from_keys_vals(table_name, keys, vals):
if not keys:
logging.error('Glue table has is missing partition keys') # TODO add table name in error msg
return '' # TODO should raise error? or filter out partitions with problem?
if not vals:
logging.error('Glue table has is missing partition values')
return ''
if len(keys) != len(vals):
logging.error('Glue table has different number of partition keys in table and values in partition')
return ''
s_keys = []
for k, v in zip(keys, vals):
s_keys.append('%s=%s' % (k['name'], v))
return '/'.join(s_keys) # TODO escape chars in keys and values, see https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java#L256
Got the same problem, Any update @dichenli?
Thanks
I'm running the script to migrate Glue catalog data crawled from hive style (key=value) partition data from s3 and then migrating to hive metstore(MySQL). And the partition that is getting created in the hive metastore is incorrect. [+] https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration
Note: Looks like Glue catalog data crawled from partitioned S3 is fine, as launching New EMR cluster with the Glue catalog is working fine and partition information is correct. Also, Athena using the Glue is able to find the partition of the table properly. But the script migrating table information from glue catalog to metastore is getting messed up, hence creating totally wrong partition information in hive metastore.
Please find the steps carried out:
1) S3 path from which crawler was executed (You can see the data is in proper layout or YYYY=value/mm=value)
2) Once Crawler was done, I have launched a new EMR cluster by pointing to Glue Catalog while launching. After that executed the following commands on the cluster (Hive & Spark). And it is showing the partition is proper format).
================ 3) Kindly note that the catalog-2-migration script (export_from_datacatalog.py) will not work with the following key constraint error:
"duplicate entry for key 'UNIQUE_DATABASE' ..... java.sql.BatchUpdateException: Field 'IS_REWRITE_ENABLED' doesn't have a default value"
I found the column 'IS_REWRITE_ENABLED' is in table hive.TBLS. A strange thing I found is this column can be NULL in table definition. However, the Spark job complains about the default value. So I manually login to my Hive metastore and updated the default value: ALTER TABLE hive.TBLS ALTER IS_REWRITE_ENABLED SET DEFAULT 1;
After this small change, the Glue ETL job completed successfully. But the partition generated by the script is totally incorrect.
Partition messed up
Although the table description is same-
So its totally an issue with the migration script and I'm stuck with our migration process
So kindly look into the issue on an urgent basis and fix the script or provide me a workaround or solution.