[SUPPORT] Can not create a Path from an empty string on unpartitioned table

vansimonsen commented 3 years ago

Describe the problem you faced

Issue trying to create unpartitioned tables to hive metastore (in aws glue data catalog) using hudi (Tested on 0.6.0, 0.7.0 and 0.8.0 )
Using hudi on AWS EMR, with pyspark

Previous fix is implemented on new versions, but it continues failing

Hudi config for unpartitioned tables

hudiConfig = {
"hoodie.datasource.write.precombine.field": <column>,
"hoodie.datasource.write.recordkey.field": _PRIMARY_KEY_COLUMN,
"hoodie.datasource.write.keygenerator.class": 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
"hoodie.datasource.hive_sync.partition_extractor_class": 'org.apache.hudi.hive.NonPartitionedExtractor',
"hoodie.datasource.write.hive_style_partitioning": "true",
"className": "org.apache.hudi",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.consistency.check.enabled": "true",
"hoodie.datasource.hive_sync.database": DB_NAME,
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.support_timestamp": "true",
}

To Reproduce

Steps to reproduce the behavior:

Run hudi with hive integration
Try to create an unpartitioned table, with config previously specified

Expected behavior

The table would be created without throw the exception, without any partition or default partitionpath

Environment Description

Hudi version : 0.6.0, 0.7.0 and 0.8.0
Spark version : 2.4.7
Hive version : Aws glue data catalog integration on EMR
Hadoop version : Amazon Hadoop distribution
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Stacktrace

 org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210407181606
    at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:496)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:150)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
    at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:355)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:403)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:399)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
    at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:399)
    at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:460)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:217)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
    at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
    at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
    at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
    at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)
    at org.apache.hadoop.fs.Path.<init>(Path.java:180)
    at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
    at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
    at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
    at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
    at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:552)
    at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
    at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:385)
    at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:494)
    ... 46 more

aditiwari01 commented 3 years ago

Issue (https://github.com/apache/hudi/issues/2801) might be a duplicate.

However while creating an unpartitioned table, my dataframe.write is getting succeeded but I am not able to query the data via hive. Although spark read are working fine for me though. (Testing via spark shell and I am using jdbc to connect to hive)

n3nash commented 3 years ago

@vansimonsen Can you check the issue that @aditiwari01 is pointing to and check if you are using the correct KeyGenerators as well as PartitionValueExtractor (check here -> https://hudi.apache.org/docs/configurations.html#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY) ?

Additionally, this looks like the basePath might not have been correctly registered to Glue. Let me know after you check these configs, if they don't work, this may be a legit bug

ismailsimsek commented 3 years ago

its might be related to missing Glue database s3 path, the field is named "Amazon S3 path"(lakeformation) or "Location"(glue) in aws console

as far as i see at one point in code it is tryiong to construct a path like : getDatabasePath +tableName in my case it was creating: s3://MyBucketMytable because of missing /. at the end of the database Location

n3nash commented 3 years ago

@ismailsimsek Are you saying it was fixed after you fixed the databasePath / location in your glue metastore to include / ? Is the / expected always at the end of the path ? If yes, we can probably put in that fix in hudi hive sync.

@vansimonsen Can you check if this is the root cause for you ?

n3nash commented 3 years ago

@ismailsimsek @vansimonsen Closing this due to inactivity, please re-open it or open a new one if you need further assistance.

pranotishanbhag commented 3 years ago

I am facing the same issue. Please can you share the fix. I am using Hudi version 0.8.

apache / hudi

[SUPPORT] Can not create a Path from an empty string on unpartitioned table #2797