aws-samples / aws-glue-samples

AWS Glue code samples
MIT No Attribution
1.42k stars 812 forks source link

hive_metastore_migration.py fails with AttributeError: 'str' object has no attribute '_jdf' #120

Open jobennin opened 2 years ago

jobennin commented 2 years ago

Testing the HMS migration script with spark-submit command fails with: AttributeError: 'str' object has no attribute '_jdf'

which is triggered by the call: id_type = df.get_schema_type(id_col)

If I change the call to: id_type = get_schema_type(df, id_col) I get past the error but expose other df related errors in other functions.

This is tested on:

"emr-5.31.0" "Hadoop":"2.10.0" "Hive":"2.3.7" "Spark":"2.4.6"

Full stack trace: Traceback (most recent call last): File "/home/hadoop/hive_metastore_migration.py", line 1525, in main() File "/home/hadoop/hive_metastore_migration.py", line 1519, in main etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options) etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options) etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options) File "/home/hadoop/hive_metastore_migration.py", line 1414, in etl_from_metastore etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options) File "/home/hadoop/hive_metastore_migration.py", line 1414, in etl_from_metastore File "/home/hadoop/hive_metastore_migration.py", line 1414, in etl_from_metastore .transform(hive_metastore) .transform(hive_metastore) .transform(hive_metastore) File "/home/hadoop/hive_metastore_migration.py", line 753, in transform ms_database_params=hive_metastore.ms_database_params) File "/home/hadoop/hive_metastore_migration.py", line 734, in transform_databases dbs_with_params = self.join_with_params(df=ms_dbs, df_params=ms_database_params, id_col='DB_ID') File "/home/hadoop/hive_metastore_migration.py", line 336, in join_with_params df_params_map = self.transform_params(params_df=df_params, id_col=id_col) File "/home/hadoop/hive_metastore_migration.py", line 314, in transform_params return self.kv_pair_to_map(params_df, id_col, key, value, 'parameters') File "/home/hadoop/hive_metastore_migration.py", line 326, in kv_pair_to_map id_type = df.get_schema_type(id_col) File "/home/hadoop/hive_metastore_migration.py", line 199, in get_schema_type return df.select(column_name).schema.fields[0].dataType File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1327, in select AttributeError: 'str' object has no attribute '_jdf'

I have also tried with EMR v6.5 with Spark v3.1.2. Same error. I thought it might be Spark version issue. What Spark version has this script been successful with? EMR version? I launch the spark-submit per the readme with the --jdbc* options changed as needed.

Dearkano commented 2 years ago

same issue here