apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.23k stars 2.39k forks source link

[SUPPORT] Cannot create a hudi table that has a column starting with a digit #10553

Closed Mourya1319 closed 6 months ago

Mourya1319 commented 6 months ago

I am trying to create a hudi table that has a column starting with numbers/digits. Getting the below error.

Steps to reproduce the behavior:

  1. create database if not exists db location 's3://<>'
  2. create table if not exists db.sample(id int, 360p string, 720p string) using hudi location 's3://<>'
  3. insert into db.sample values (1,'yes','yes'), (2,'no','no')

Expected behavior

Create a hudi table at specified location.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace


  at org.apache.avro.Schema.validateName(Schema.java:1603)
  at org.apache.avro.Schema.access$400(Schema.java:92)
  at org.apache.avro.Schema$Field.<init>(Schema.java:556)
  at org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2258)
  at org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2254)
  at org.apache.avro.SchemaBuilder$FieldBuilder.access$5100(SchemaBuilder.java:2150)
  at org.apache.avro.SchemaBuilder$GenericDefault.noDefault(SchemaBuilder.java:2557)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:205)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:202)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:204)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:202)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:204)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:202)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:186)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:204)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
  at org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:202)
  at org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.initHoodieTable(HoodieCatalogTable.scala:217)
  at org.apache.spark.sql.hudi.command.CreateHoodieTableCommand.run(CreateHoodieTableCommand.scala:71)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
  at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
  at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
  at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530)
  at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
  at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:221)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:640)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:630)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:671)
  ... 44 elided```

My 2 cents about the problem:

I believe Hudi uses Avro as data serialization framework. It writes data files as Parquets and Metadata Files as Avro Files. Avro Files rely on JSON Schema for schema validation during the serialization and de-serialization process. And the problem with JSON is that in a key-value pair, KEY should always be a STRING and if you put a digit in double quotes it still doesn't recognize that as a STRING. I think the problem somehow related to this.
I wanted to know if we can change any configuration for Hudi which allows us to use PARQUET for Metadata Files as well. I have searched the Hudi Documentation but I couldn't find anything useful.
ad1happy2go commented 6 months ago

@Mourya1319 As of now, Hudi is relying on Avro serialization framework only. Not sure if this can be supported. cc @danny0405

Mourya1319 commented 6 months ago

Thanks for the reply!. So, there is no other way that we can have a column starting with numbers in Hudi @ad1happy2go ?

danny0405 commented 6 months ago

It is feasible if the column data type is specified as numeric or string, what is the data type then?

Mourya1319 commented 6 months ago

It is String @danny0405

Screenshot 2024-01-23 at 6 00 58 PM
danny0405 commented 6 months ago

Then it should be okay, as long as there is no un-printable chars.

Mourya1319 commented 6 months ago

@danny0405 , In the above image, the query failed to create a Hudi table. And the error Illegal initial character:360p is because the column started with a number 3. This query: create table if not exists db3_hudi.sample(id int, a360p string, a720p string) using hudi, actually created a Hudi table without any error. So, My understanding is that since the column names didn't start with a number, this query was able to create a table.

I wanted to know if I am doing in the wrong way or is there any other way that can help me create a Hudi table with columns starting with a digit, kindly let me know.

Feel free to correct me If I am wrong.

ad1happy2go commented 6 months ago

@danny0405 Actually the issue is not in the values but the column name itself.

@Mourya1319 You are correct, Hudi doesn't support column names starting with integer as avro schema validation will fail for this.

image
Mourya1319 commented 6 months ago

Thanks @ad1happy2go @danny0405 for the support!

ad1happy2go commented 6 months ago

@Mourya1319 Closing this then. Please reopen in case of any concerns. Thanks.