aws-samples / aws-glue-samples

AWS Glue code samples
MIT No Attribution
1.41k stars 805 forks source link

Glue catalog to Hive Metastore Migration script not working with partition table #15

Open Kuntal-G opened 6 years ago

Kuntal-G commented 6 years ago

I'm running the script to migrate Glue catalog data crawled from hive style (key=value) partition data from s3 and then migrating to hive metstore(MySQL). And the partition that is getting created in the hive metastore is incorrect. [+] https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration

Note: Looks like Glue catalog data crawled from partitioned S3 is fine, as launching New EMR cluster with the Glue catalog is working fine and partition information is correct. Also, Athena using the Glue is able to find the partition of the table properly. But the script migrating table information from glue catalog to metastore is getting messed up, hence creating totally wrong partition information in hive metastore.

Please find the steps carried out:

1) S3 path from which crawler was executed (You can see the data is in proper layout or YYYY=value/mm=value)

a0999b1381a5:~ kuntalg$ aws s3 ls --recursive s3://kg-practice/elb_logging/test
2018-02-06 15:54:06          0 elb_logging/test/
2018-02-06 15:54:35          0 elb_logging/test/year=2015/
2018-02-06 15:55:00          0 elb_logging/test/year=2015/month=01/
2018-02-06 16:31:43         22 elb_logging/test/year=2015/month=01/test2.csv
2018-02-06 15:55:08          0 elb_logging/test/year=2015/month=02/
2018-02-06 16:31:57         22 elb_logging/test/year=2015/month=02/test3.csv

2) Once Crawler was done, I have launched a new EMR cluster by pointing to Glue Catalog while launching. After that executed the following commands on the cluster (Hive & Spark). And it is showing the partition is proper format).

scala> spark.sql("SHOW PARTITIONS test").show(30,false)
18/02/06 16:27:59 WARN CredentialsLegacyConfigLocationProvider: Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].
+------------------+
|partition         |
+------------------+
|year=2015/month=01|
|year=2015/month=02|
+------------------+
hive> show partitions test;
OK
year=2015/month=02
year=2015/month=01
Time taken: 0.54 seconds, Fetched: 2 row(s)
hive> describe test;
OK
col0                    bigint                                      
col1                    string                                      
year                    string                                      
month                   string                                      

# Partition Information      
# col_name              data_type               comment             

year                    string                                      
month                   string                                      
Time taken: 0.651 seconds, Fetched: 10 row(s)
hive> select * from test;
OK
1   Monty   2015    02
2   Trish   2015    02
5   Lisa    2015    02
1   kuntal  2015    01
2   Rock    2015    01
3   Cena    2015    01
Time taken: 3.047 seconds, Fetched: 6 row(s)

hive> select * from test where year='2015' and month='02';
OK
1   Monty   2015    02
2   Trish   2015    02
5   Lisa    2015    02
Time taken: 0.971 seconds, Fetched: 3 row(s)

================ 3) Kindly note that the catalog-2-migration script (export_from_datacatalog.py) will not work with the following key constraint error:

"duplicate entry for key 'UNIQUE_DATABASE' ..... java.sql.BatchUpdateException: Field 'IS_REWRITE_ENABLED' doesn't have a default value"

I found the column 'IS_REWRITE_ENABLED' is in table hive.TBLS. A strange thing I found is this column can be NULL in table definition. However, the Spark job complains about the default value. So I manually login to my Hive metastore and updated the default value: ALTER TABLE hive.TBLS ALTER IS_REWRITE_ENABLED SET DEFAULT 1;

After this small change, the Glue ETL job completed successfully. But the partition generated by the script is totally incorrect.

Partition messed up

hive> show partitions test;
OK
year(string),month(string)=2015,01
year(string),month(string)=2015,02
Time taken: 0.176 seconds, Fetched: 2 row(s)

Although the table description is same-

hive> describe test;
OK
col0                    bigint                                      
col1                    string                                      
year                    string                                      
month                   string                                      

# Partition Information      
# col_name              data_type               comment             

year                    string                                      
month                   string                                      
Time taken: 0.492 seconds, Fetched: 10 row(s)

So its totally an issue with the migration script and I'm stuck with our migration process

So kindly look into the issue on an urgent basis and fix the script or provide me a workaround or solution.

dichenli commented 6 years ago

Thank you for identifying the issue. I repeated the problem. It looks like there is a bug here: https://github.com/awslabs/aws-glue-samples/blob/fd8cab884e6f636be37f677cbfd7db7f6e9fc6ab/utilities/Hive_metastore_migration/src/hive_metastore_migration.py#L809

It parses partition keys and values from Glue into Hive partition names. The generated partName is like "year(string),month(string)=2015,02" whereas Hive expects something like "year=2015/month=02".

I found that "year(string),month(string)=2015,02" actually works on "DESCRIBE test" query on Hive 1.0.0, so the author might wrote code based on this version of Hive. But "year=2015/month=02" is the standard format, so I'll push a bug fix for it.

ultrasonex commented 6 years ago

HI @Kuntal-G Thanks for replicating and posting the issue on our behalf.

Hi @dichenli ,

Thanks for picking this up. We are kind of blocked right now . Can you provide ETA for this ?
Really appreciate your help on this.

Thanks, Niloy

dichenli commented 6 years ago

I apologize that I can't provide an ETA yet. A thorough fix may take some time to be pushed to GitHub, but to unblock yourself immediately, you may paste a quick fix code snippet below to replace the function with problem. It should work most of the time.

@staticmethod
def udf_partition_name_from_keys_vals(table_name, keys, vals):
        if not keys:
            logging.error('Glue table has is missing partition keys') # TODO add table name in error msg
            return '' # TODO should raise error? or filter out partitions with problem?
        if not vals:
            logging.error('Glue table has is missing partition values') 
            return ''
        if len(keys) != len(vals):
            logging.error('Glue table has different number of partition keys in table and values in partition')
            return ''
        s_keys = []
        for k, v in zip(keys, vals):
            s_keys.append('%s=%s' % (k['name'], v))

        return '/'.join(s_keys) # TODO escape chars in keys and values, see https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java#L256
jiajie999 commented 6 years ago

Got the same problem, Any update @dichenli?

Thanks