apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.34k stars 2.42k forks source link

Hudi 0.5.2 inability save complex type with nullable = true [SUPPORT] #1550

Closed badion closed 4 years ago

badion commented 4 years ago

Currenlty we are working with Hudi 0.5.0 and AWS Glue, everything working fine for .parquet and COW mode, with complex types in data and different nullable options.

After switching to Hudi 0.5.2 , start facing the issues related to:

https://github.com/apache/incubator-hudi/pull/1406

Spark application fails while writing Dataframe into Hudi table when using complex types like:

{
   "city":[
      {
         "name":"some_name",
         "index":"some_index"
      }
   ]
}

And having nullable fields = true for it. Till the moment of saving, everything is fine, and we are able to see complete dataframe:

+----------------------------+
|city                        |
|[[some_name, some_index]].  |
+----------------------------+
root
 |-- city: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- index: string (nullable = true)

Note that All simple types working fine with saving data into Hudi table, as well as complex types using nullable = false

Steps to reproduce the behavior:

from pathlib import Path

spark = SparkSession.builder \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.jars.packages",
            "org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.4") \
    .appName('nested_type_hudi') \
    .enableHiveSupport() \
    .getOrCreate()

PROJECT_PATH = str(Path(__file__).parent)

input_data = """{"city":[{"name":"some_name","index":"some_index"}]}"""

schema = StructType([
        StructField('city', ArrayType(StructType([StructField('name', StringType(), True),
                                                  StructField('index', StringType(), True)]), True), True)
    ])

options = {
        'hoodie.table.name': "nested_hierarchy_example",
        'hoodie.datasource.write.precombine.field': "object_ts",
        'hoodie.datasource.write.recordkey.field': "recordkey"
    }

nested_hierarchy_df = spark.read.schema(schema).json(spark.sparkContext.parallelize([input_data])) \
        .withColumn('object_ts', lit(123)) \
        .withColumn('recordkey', lit('abc')) 

write_table(nested_hierarchy_df, options, 'append', f'file://{PROJECT_PATH}/test_data/nested_output')

def write_table(df, options, mode, output_dir):
    df.write.format("org.apache.hudi").options(**options).mode(mode).save(output_dir)

Expected behavior Hudi table should be successfully saved in parquet format with complex type fields, which contains nullable = true. Hudi 0.5.0 working fine with all variety of complex types and nullable fields.

Local/AWS Glue 1.0:

Stacktrace

java.io.IOException: Could not create payload for class: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
    at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:125)
    at org.apache.hudi.DataSourceUtils.createHoodieRecord(DataSourceUtils.java:178)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:102)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:99)

...

Caused by: org.apache.avro.UnresolvedUnionException: Not in union [{"type":"record","name":"city","namespace":"hoodie.nested_hierarchy_example.nested_hierarchy_example_record","fields":[{"name":"name","type":["string","null"]},{"name":"index","type":["string","null"]}]},"null"]: {"name": "some_name", "index": "some_index"}
    at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
    at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
    at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)

...

Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class 
    at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
    at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:122)
    ... 28 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

Is this is already a known issue for Hudi greater 0.5.0? if there is a workaround that would allow us to upgrade to 0.5.2?

badion commented 4 years ago

As a note, Hudi 0.5.2, was packaged from master 1 day ago

vinothchandar commented 4 years ago

@badion This does seem directly related to the complex types issue fixed recently.. 0.5.1-2 we moved out of databricks-avro and to spark-avro and this seems like a miss.

Are you interested in a custom patch for this on top of 0.5.2? Not sure I follow the last sentence.. Please clarify, happy to get this moving along for you..

cc @umehrot2 @zhedoubushishi as well to chime in

badion commented 4 years ago

@vinothchandar Seems like issue gone after building .jar file from commit(merge) - ce0a4c64d07d6eea926d1bfb92b69ae387b88f50, which was apparently after release of Hudi release 0.5.2. Also one thing that we tried to use hudi jar from mvn central, it seems like it doesn't have fix with avro yet.

I think will will wait next release, which will include those changes.

umehrot2 commented 4 years ago

@badion yeah the fix for this did not make it to 0.5.2. You can either build your custom Hudi with this patch applied on top of 0.5.2 or wait until next release.

bvaradar commented 4 years ago

Closing this issue as it will be resolved in next release.

rolandjohann commented 4 years ago

First thanks for the great lib, that reduces complexity of our ETL pipelines massively!

Is the next release date in the near future? I'm asking because the latest release contains this existential bug that causes the library to simply not work. Currently I'm evaluating this as alternative to delta lake and reached the point of this issue pretty fast. Is it possible to release a hotfix that at new users are able to start working with this lib by following the getting started section and start to implement more complex data models?

vinothchandar commented 4 years ago

@rolandjohann Thanks for the feedback.. We are trying to bundle few more such fixes and release 0.6.0 later this month... backporting some fixes alone on 0.5.2 and doing a 0.5.3 may make sense though.. Let me bring this up with the community and see what everyone feels..

vinothchandar commented 4 years ago

You can follow this here btw https://lists.apache.org/thread.html/r1fb5ad5547f55f40b20306dac90a711c9c0e29f6855f63b6b2118987%40%3Cdev.hudi.apache.org%3E

nikitap95 commented 4 years ago

Hi, any updates on when would this be released and rolled out?

vinothchandar commented 4 years ago

@nsivabalan is driving the release.. We are planning to do a 0.5.3 this week. right siva ? This release will have the fix.. @nikitap95 if interested, you can join the mailing list and help validate the release candidate :)

nikitap95 commented 4 years ago

@vinothchandar Thanks for your prompt response. Will wait for the release in that case rather than using the patch. Sure, I'll get myself added to it, would be great to be a part of it!

nsivabalan commented 4 years ago

yes, I should have a candidate up for voting by today or tomorrow.