Open BCriswell opened 1 year ago
Guess it is translated into a bytes array with fixed precicion and scale of Avro: https://avro.apache.org/docs/1.10.2/spec.html#schema_complex, see the fixed type part.
hi @danny0405 , is there any news on this issue or any plan to solve this?
We're planning to use CDC format to handle some complex incremental processing use cases like presented in this blog https://www.onehouse.ai/blog/getting-started-incrementally-process-data-with-apache-hudi. However, with decimal values not returned correctly, we couldn't make use of CDC format.
sure, @phamvinh1712 would you mind to fire a fix, it might be a minor fix for avro and JSON type conversion I guess.
@danny0405 yep, let me take this up some time next week. i just found where the issue is.
@phamvinh1712 Thanks so much, I would be gald to review the PR.
I've noticed an issue with the
data_before_after
CDC mode not converting Spark DecimalType correctly. The decimals are getting converted to an array in the before and after json strings when the cdc data is saved, which then results in null values when trying to convert back to a Row using F.from_json() along with the original schema because Spark can't cast the array to a valid DecimalType. Example:Querying the hudi table normally:
gljeln=Decimal('208.000000000000000000')
Querying using the cdc format + incremental options:
Row(op='i', ts_ms='20230425193451991', before='null', after='{"gljeln": [0, 0, 0, 0, 0, 0, 0, 11, 70, -108, 113, -8, 1, 64, 0, 0]...
Steps to reproduce the behavior:
Expected behavior
The decimal value should be serialized to an appropriate type (probably a string) that can be deserialized without corrupting the data.
Environment Description
Hudi version : hudi-spark3.3-bundle_2.12-0.13.0.jar
Spark version : 3.3.1
Hive version : N/A
Hadoop version : N/A
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Additional context
Example script to reproduce, and results:
Incremental CDC query:
And the result: