aws-samples / aws-glue-samples

AWS Glue code samples
MIT No Attribution
1.42k stars 812 forks source link

Make the Hive Metastore Migration ETL job work with Python3 in Glue 2 and Glue 3 #124

Closed leejianwei closed 2 years ago

leejianwei commented 2 years ago

This pull request is to make the Hive Metastore Migration ETL job compatible with Python 3 on Glue 2.0 and Glue 3.0.

Description of the issue: Original ETL job script had the following error with Python 3 in Glue 2 and Glue 3:

2022-05-07 09:08:18,037 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File “/tmp/import_into_datacatalog.py”, line 130, in <module>
    main()
  File “/tmp/import_into_datacatalog.py”, line 126, in main
    region=options.get(‘region’) or ‘us-east-1’
  File “/tmp/import_into_datacatalog.py”, line 51, in metastore_full_migration
    sc, sql_context, db_prefix, table_prefix).transform(hive_metastore)
  File “/tmp/localPyFiles-2f7485a4-1b85-49e8-881e-ab0046592867/hive_metastore_migration.py”, line 753, in transform
    ms_database_params=[hive_metastore.ms](http://hive_metastore.ms/)_database_params)
  File “/tmp/localPyFiles-2f7485a4-1b85-49e8-881e-ab0046592867/hive_metastore_migration.py”, line 734, in transform_databases
    dbs_with_params = self.join_with_params(df=ms_dbs, df_params=ms_database_params, id_col=‘DB_ID’)
  File “/tmp/localPyFiles-2f7485a4-1b85-49e8-881e-ab0046592867/hive_metastore_migration.py”, line 336, in join_with_params
    df_params_map = self.transform_params(params_df=df_params, id_col=id_col)
  File “/tmp/localPyFiles-2f7485a4-1b85-49e8-881e-ab0046592867/hive_metastore_migration.py”, line 314, in transform_params
    return self.kv_pair_to_map(params_df, id_col, key, value, ‘parameters’)
  File “/tmp/localPyFiles-2f7485a4-1b85-49e8-881e-ab0046592867/hive_metastore_migration.py”, line 326, in kv_pair_to_map
    id_type = df.get_schema_type(id_col)
  File “/tmp/localPyFiles-2f7485a4-1b85-49e8-881e-ab0046592867/hive_metastore_migration.py”, line 199, in get_schema_type
    return df.select(column_name).schema.fields[0].dataType
  File “/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py”, line 1320, in select
    jdf = self._jdf.select(self._jcols(*cols))
AttributeError: ‘str’ object has no attribute ‘_jdf’

Description of changes: The root cause cause of above error is that we used dynamic method (MethodType) in the ETL script, for example Line 287 in the new file. Some of the methods are used either by direct method call or MethodType. However for Python 3, dynamic method for class is different with direct method call. We made a change to create different method for different scenario to make it available on Python 3.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

moomindani commented 2 years ago

Merged. Thank you for your contribution!