apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.39k stars 2.42k forks source link

[FEATURE REQUEST] PartialUpdateAvroPayload Overwrite Existing Value With NULL #10091

Closed Hans-Raintree closed 11 months ago

Hans-Raintree commented 11 months ago

Hello,

I have a usecase where I'm combining two tables into a single table, but the tables might not be updated at the same time so I'm using PartialUpdateAvroPayload to keep the data from the other table as is when only one of the tables are updated. This works great, but sometimes I need to actually update the value to NULL, I always know the records for which I need to do this beforehand, so if there were an option to specify some special value that would result in updating it to NULL it would fix my issue.

ie.

'hoodie.datasource.write.null_value': '@@to_null'

And the columns that I want to actually overwrite to NULL I would set as that value, while the columns that are NULL would work as before.

This might be more complicated for integer/timestamp columns though.

I'm aware of the custom payload functionality, but since I'm running hudi in AWS EMR Serverless there seem to be compatibility issues everytime I try to submit my own jars instead of using the AWS hudi jars.

ad1happy2go commented 11 months ago

@Hans-Raintree You can write your own custom payload according to your requirement. You can configure columns the way you wanted. Refer this example. https://gist.github.com/bhasudha/7ea07f2bb9abc5c6eb86dbd914eec4c6

Let me know in case of any doubts. Thanks.

ad1happy2go commented 11 months ago

But I guess it makes sense to have something like this in our PartialUpdateAvroPayload. We can have a column list which will be forcefully updated even if its null.

JIRA - https://issues.apache.org/jira/browse/HUDI-7091

Hans-Raintree commented 11 months ago

Hey @ad1happy2go,

That would work as well, I would have to do two upserts, one for each table, I'm merging into the single table.

Thanks for putting in the ticket!

The issue with the custom payload as I understand is that I would have to compile my own jars, but I'm using AWS EMR which uses custom jars, ie. "hudi-spark3-bundle_2.12-0.13.1-amzn-0.jar". When I try to run with the standard jars I run into compatibility issues. It would be nice if there was a way to add custom payload/commit callbacks separately without having to build the entire packages.

ad1happy2go commented 11 months ago

@Hans-Raintree You can build your own custom jar using hudi OSS package scope as provided. So your jar just includes the new class and then add it as the extra dependency with --jars.

ad1happy2go commented 11 months ago

@Hans-Raintree Let us know if you face any issues in implementing custom payload. If all good, feel free to close this issue.

Hans-Raintree commented 11 months ago

Thanks @ad1happy2go, I went the route of implementing my own custom payload and it seems to work!

nsivabalan commented 11 months ago

hey @Hans-Raintree : Do you mind contributing the custom payload to the community. We can call it as PartialUpdateWithOptionalNullsAvroPayload may be

ad1happy2go commented 8 months ago

@Hans-Raintree If possible, can you please provide the implementation. Someone else in community was asking for the same.