apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.23k stars 2.39k forks source link

[SUPPORT] Requesting Support for insert_overwrite in Delta Streamer #10896

Closed soumilshah1995 closed 3 months ago

soumilshah1995 commented 4 months ago

Hello,

image

I recently created a video tutorial on backfilling with Hudi, and during my experimentation, I encountered a challenge regarding the insert_overwrite method while using the Delta Streamer. I've been primarily working with PySpark for these tasks.

Video Link: Video on Backfilling with Hudi

Code Base: GitHub Repository

In my workflow, I intended to perform an insert_overwrite on an entire partition. However, when attempting to execute the insert_overwrite job, I encountered an error indicating that the method was not found.

I understand that I have successfully executed similar tasks using PySpark. My question is: does Delta Streamer support the insert_overwrite operation? If it does not currently support this operation, I would like to request adding this feature to enhance the functionality of Delta Streamer.

I believe that supporting insert_overwrite would greatly benefit users who rely on Delta Streamer for data backfilling and other data management tasks.

Thank you for your attention to this matter. I look forward to hearing from the community regarding the feasibility of adding this feature.

Slack Thread https://apache-hudi.slack.com/archives/C4D716NPQ/p1710788154289249

ad1happy2go commented 4 months ago

Thanks @soumilshah1995 for the details. can you share the full Hudi Streamer command/code which you were using?

soumilshah1995 commented 4 months ago

its here https://github.com/soumilshah1995/DeltaHudiTransformations

soumilshah1995 commented 3 months ago

I tried --op insert_overwrite I guess its not natively supported on docs as well it says support for insert | bulk_insert| upsert

ad1happy2go commented 3 months ago

@soumilshah1995 I am just thinking does insert_overwrite is even make sense for streaming workloads. Do you have any such use case?

For sources like Kafka, for sure doesn't makes sense at all.

ad1happy2go commented 3 months ago

The only use case I can think of is for table maintenance activity, may be run it with mode run_once.

soumilshah1995 commented 3 months ago

Could you confirm if DeltaStreamer supports "insert_overwrite"? If not, I'm interested in understanding why. The reason for this inquiry is that in scenarios where I'm utilizing SQLSource and need to rectify an entire partition from which I'm reading, I would prefer to use "insert_overwrite" as it facilitates index lookup, akin to what "upsert" would accomplish. Ideally, having support for "insert_overwrite" in DeltaStreamer would prove immensely beneficial.

soumilshah1995 commented 3 months ago

adding insert_overwrite can also help to build gold zone I can read data from multiple hudi tables and insert_overwrite into gold aggregated tables

ad1happy2go commented 3 months ago

@soumilshah1995 This makes sense. Create a JIRA also to track - https://issues.apache.org/jira/browse/HUDI-7558

ad1happy2go commented 3 months ago

As, Sudha suggested, can you also send a mail to dev list thread and point the conversation here. Good to hear thought on this from others.

soumilshah1995 commented 3 months ago

Roger that

soumilshah1995 commented 3 months ago

you want me to close this ?

soumilshah1995 commented 3 months ago

ill send email dev@hudi.apache.org ill close this thread