Closed soumilshah1995 closed 3 months ago
Thanks @soumilshah1995 for the details. can you share the full Hudi Streamer command/code which you were using?
I tried --op insert_overwrite I guess its not natively supported on docs as well it says support for insert | bulk_insert| upsert
@soumilshah1995 I am just thinking does insert_overwrite is even make sense for streaming workloads. Do you have any such use case?
For sources like Kafka, for sure doesn't makes sense at all.
The only use case I can think of is for table maintenance activity, may be run it with mode run_once.
Could you confirm if DeltaStreamer supports "insert_overwrite"? If not, I'm interested in understanding why. The reason for this inquiry is that in scenarios where I'm utilizing SQLSource and need to rectify an entire partition from which I'm reading, I would prefer to use "insert_overwrite" as it facilitates index lookup, akin to what "upsert" would accomplish. Ideally, having support for "insert_overwrite" in DeltaStreamer would prove immensely beneficial.
adding insert_overwrite can also help to build gold zone I can read data from multiple hudi tables and insert_overwrite into gold aggregated tables
@soumilshah1995 This makes sense. Create a JIRA also to track - https://issues.apache.org/jira/browse/HUDI-7558
As, Sudha suggested, can you also send a mail to dev list thread and point the conversation here. Good to hear thought on this from others.
Roger that
you want me to close this ?
ill send email dev@hudi.apache.org ill close this thread
Hello,
I recently created a video tutorial on backfilling with Hudi, and during my experimentation, I encountered a challenge regarding the insert_overwrite method while using the Delta Streamer. I've been primarily working with PySpark for these tasks.
Video Link: Video on Backfilling with Hudi
Code Base: GitHub Repository
In my workflow, I intended to perform an insert_overwrite on an entire partition. However, when attempting to execute the insert_overwrite job, I encountered an error indicating that the method was not found.
I understand that I have successfully executed similar tasks using PySpark. My question is: does Delta Streamer support the insert_overwrite operation? If it does not currently support this operation, I would like to request adding this feature to enhance the functionality of Delta Streamer.
I believe that supporting insert_overwrite would greatly benefit users who rely on Delta Streamer for data backfilling and other data management tasks.
Thank you for your attention to this matter. I look forward to hearing from the community regarding the feasibility of adding this feature.
Slack Thread https://apache-hudi.slack.com/archives/C4D716NPQ/p1710788154289249