aws-samples / aws-glue-samples

AWS Glue code samples
MIT No Attribution
1.42k stars 812 forks source link

Issue migrating directly from Hive Metastore to Glue Data Catalog #112

Open vinceRicchiuti opened 2 years ago

vinceRicchiuti commented 2 years ago

I am trying to migrate my Hive Metastore (rds) to my Glue Catalog.

I configure the job to run as spark job with all kind of matching

I followed readme to migrate directly from Hive Metastore to AWS Glue Data Catalog, but i experienced " 'str' object has no attribute '_jdf' "when i run the Glue ETL job. See the full error message below:

2022-01-27 16:53:53,940 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last): File "/tmp/import_into_datacatalog.py", line 130, in main() File "/tmp/import_into_datacatalog.py", line 126, in main region=options.get('region') or 'us-east-1' File "/tmp/import_into_datacatalog.py", line 51, in metastore_full_migration sc, sql_context, db_prefix, table_prefix).transform(hive_metastore) File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 753, in transform ms_database_params=hive_metastore.ms_database_params) File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 734, in transform_databases dbs_with_params = self.join_with_params(df=ms_dbs, df_params=ms_database_params, id_col='DB_ID') File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 336, in join_with_params df_params_map = self.transform_params(params_df=df_params, id_col=id_col) File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 314, in transform_params return self.kv_pair_to_map(params_df, id_col, key, value, 'parameters') File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 326, in kv_pair_to_map id_type = df.get_schema_type(id_col) File "/tmp/localPyFiles-0b1af0c4-b70f-4147-a11b-965a99faeb92/hive_metastore_migration.py", line 199, in get_schema_type return df.select(column_name).schema.fields[0].dataType File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1671, in select jdf = self._jdf.select(self._jcols(*cols))AttributeError: 'str' object has no attribute '_jdf' Actually i dunno how to manage this error. Could you give me some helps or suggestion?