apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.31k stars 2.41k forks source link

Question, Add Support to Hudi datasets to spark structured streaming #1839

Closed rubenssoto closed 4 years ago

rubenssoto commented 4 years ago

Hi guys, how are you?

I have some use cases that I want to read using structured streaming from a hudi dataset and write to another grouped hudi dataset. In a real world example, I have a raw zone in my datalake, and want to streaming from raw zone to curated zone, but in sometimes my curated hudi dataset is grouped.

Spark streaming don't work with hudi datasets sources, so to this use case works I need to treat hudi dataset like a normal parquet dataset, but hudi rewrite data every time and the new file has the old data plus new data, if my sink isn't grouped, it's only a deduplication problem but my sink is grouped so it isn't gonna work.

I don't have guarantee that all my grouped data is in the new file that hudi writes.

I use pyspark to write my streaming jobs, its easier for my team, o I think that delta streamer is not an option.

Do you have some idea how to solve this? And you have plans to support hudi dataset to a spark streaming source?

Delta Lake has solved this problem with ignoreChanges option https://docs.databricks.com/delta/delta-streaming.html

vinothchandar commented 4 years ago

@rubenssoto yes. we already support incremental queries using the spark datasource. It seems like the only thing missing here is that you want the spark structured streaming integration? (which we can add after 0.6.0) https://hudi.apache.org/docs/querying_data.html#spark-incr-query

https://www.youtube.com/watch?v=1w3IpavhSWA actually talks about a production use-case we build using an incremental query + some grouping on the sink side. Unlike delta, Hudi actually has record level metadata around arrival times and thus does not need anything like ignoreChanges.

I am not sure if I am missing something around your use-case, but feels like you should be able to get this working incrementally end-end with what we have today (again, we can add spark streaming read support.. if there are hands to help.. cc @garyli1019? :))

rubenssoto commented 4 years ago

Hi Vinoth, thank you for your anwser.

I will see your video, probably incremental query will help me for now, but we want to use spark structured streaming like a default for all our datasets, spark streaming take care about checkpoint and stuffs like this.

If will could add spark structured streaming integration in a future version, will be great.

Thank you! :)

garyli1019 commented 4 years ago

This is an interesting feature. I created a ticket to track this. https://issues.apache.org/jira/browse/HUDI-1114.

bvaradar commented 4 years ago

Closing this ticket in favor of jira to track the feature request