Closed bithw1 closed 5 years ago
Hi @bithw1 ,
In my understanding the notice says 2 things. The first one is that you don't need to worry about watermarks. The watermarks are there to clean up the state kept in memory if 2 stream-based sources are joined. It's because one of them can be really on late and maybe you will not want to accumulate several hours of data to make the joins.
In the case of batch-streaming it's different because from one side you've a static dataset which will never be late. On the other side you've a stream dataset which may be on late though but not from the batch Dataset perspective.
Another point implies that since there are no state, the aggregation will be made only on the streaming window and not on the whole streaming data.
Long story short, batch-stream join will (should) output the joined entries as soon as a match between static and dynamic dataset is found, only within the time of the processing window. With stream-stream joins and data retention are controlled by the watemark
Although, I'd like to see whether we can't override the behavior and use watermarks on batch-streaming joins ? Maybe I'll investigate it at the beginning of 2019.
Best regards, Bartosz.
Thanks @bartosz25 for the detail explanation. stream join is still a bit difficult for me to thoroughly understand it so far
Hi, @bartosz25
I am reading your posts about structured streaming with join, and from spark's official document about stream-static join http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-static-joins,
it says:
I don't understand what it means and I write a simple test case which simply runs
select t1.id, count(t2.id) from t1 join t2 on t1.id = t2.id group by t1.id
Full code is:
I think this is a stateful stream-static join(aggregate after join), so I don't understand what the above note means. Could you please take a look and explain? thanks!