Open junkri opened 5 months ago
Thanks @junkri for raising this . We will look into this.
@junkri This looks different than https://github.com/apache/hudi/issues/7717 as that is fixed with later version of spark in which its dependency included this fix (https://issues.apache.org/jira/browse/PARQUET-1441) .
It is still failing with spark 3.4 and hudi 0.14.1. Created a JIRA to track a fix on this. https://issues.apache.org/jira/browse/HUDI-7602
thank you very much for creating a Jira issue on this.
I also found out that the same error is triggered when we have a decimal
and a struct
field with the same name, so this also causes issues:
spark.sql(s"""
create table trick(
tricky struct<tricky decimal(10,2)>
)
using hudi
location '$location'
""")
spark.sql("""
insert into trick
values (
named_struct('tricky', 1.2)
)
""") // works fine
spark.sql("""
insert into trick
values (
named_struct('tricky', 3.4)
)
""") // org.apache.avro.SchemaParseException: Can't redefine: tricky
I suspect this also happens as the decimal
is represented as a fixed
type with an empty namespace
during parquet->avro conversion
Describe the problem you faced
When using Decimal types, I ran into a problem where Hudi cannot write into a non-empty table, getting exception as
Caused by: org.apache.hudi.exception.HoodieException: org.apache.avro.SchemaParseException: Can't redefine: <field>
I can trigger this error in 2 ways:decimal
fields with the same name, but with different precision/scale and in differentstruct
fields (so that I can use the same fieldname)decimal
field and astruct
with the same nameTo Reproduce I created a small runnable github project with 2 small examples to trigger this error: https://github.com/junkri/hudi-cant-redefine-field-demo You can run the examples with maven or from any IDE as well.
Expected behavior I expect that I can use
decimal
fields in different structures without an issueEnvironment Description I use AWS EMR serverless mainly, so I chose the versions from the last EMR 6 environment
Hudi version : 0.14.1
Spark version : 3.4.1
Hive version : --
Hadoop version : --
Storage (HDFS/S3/GCS..) : local filesystem, but happens with S3 as well
Running on Docker? (yes/no) : no
Additional context
I am aware of https://github.com/apache/hudi/issues/7717, but here I don't use very complex structures, and in my case
decimal
fields cause the issue. I tried to force to update theparquet-avro
library in my project, but it didn't help.I tried to debug into Hudi, and I saw that when it reads back from Parquet and converts things into Avro, the decimal fields are created as
fixed
Avro type, which has an emptynamespace
attribute! I guess that means the decimal fields can be defined once in the whole avro schema, and re-used later, but because of the different precision/scale settings for my decimal fields (having the same name), thesize
attribute offixed
field has to be different, so that is when we can't re-define a field.In our projects we use Kafka as input source, so we define
decimal
fields withbytes
Avro type (and not thefixed
one), so we use something like{"type": "bytes", "logicalType": "decimal", "precision": 19, "scale": 6}
. Maybe theparquet-avro
library should use it as well?Stacktrace