Closed scotts closed 4 years ago
Maybe the correct approach is the ability to have a TWindow where the contents are defined as a function (somehow), so it's application defined what is in the window. That's been a customer request in SPL as well, but might be easier to do here.
One option is to drop into the more complex Java primitive operator mode, so implement a Java primitive operator that performs the join, and then invoke it from the topology. Though that might have to wait until multiple ports are supported.
I was thinking it would make sense to have something like TStream<T>.last(Predicate<T> pred)
to support some notion of delta-based windows. But that does not actually solve my current problem, because I need to define the contents for window A based on the tuples seen on window B. In order to do what I want to do, I need to have access to the tuples and windows for both A and B at the same time.
I had not considered a Java primitive - I did not look in detail on how to integrate one, but I did look a bit. I could see that being a workaround. Although that also opens up using an SPL operator as a workaround. An unspoken rule I was adhering to was only using the Java Topology API.
What I think I want is something like
TStream<T>.join(TStream<U> other)
.In LogWatch, I need to implement a deterministic join. That is, I want the results of my join to be independent of the tuple rates on either side of the join. The windows I need to maintain are also dependent not just on the data in that window, but on the data I see on the other stream. I cannot express this with the window clause in SPL. But, in SPL, because I can have arbitrary streams converge on different ports of an operator, I can implement my own ad-hoc windows and implement a deterministic join.
The SPL code for this is here: https://github.com/scotts/streamsx.demo.logwatch/blob/master/language/com.ibm.streamsx.demo.logwatch.language/DeterministicJoin.spl
What I'm doing there is not obvious. I'm maintaining partitioned windows for each stream (which is why I have a
map
that goes from anrstring
to alist
of tuples). When I receive a tuple on theRealTime
side, I check to see if there is a match inlogins
, which is the window for theSuccesses
side. If there is no match, then I add theRealTime
tuple to thesuspects
. When I receive a tuple on theSuccesses
side, I check to see if there is a match insuspects
. If there is no match, then I add theSuccesses
tuple tologins
, so that it can be matched when/if I receive a tuple on theRealTime
side.Note that I evict tuples from the window for the
RealTime
side based on tuples seen on theSuccesses
stream, which is something I cannot express using SPL window clauses.I tried implementing this in the Java API, and I thought I was able to do it, but @hildrum and I just talked it over at length and convinced ourselves it does not work. My attempt is here: https://github.com/scotts/streamsx.demo.logwatch/blob/master/topology/src/streamsx/demo/logwatch/topology/DeterministicJoin.java
The technique that we came up with was to doing a
TStream.join
on both streams, and callTStream.last
on the other stream. That way we had mirrored joins, and we did aTStream.union
on the mirrored joins. That produces the correct results when I execute it, but we don't think it's guaranteed to do so.We thoroughly confused ourselves (well, at least I was confused) by trying to reason about when tuples were received when, and when exactly we would see tuples on the
TStream.last()
window.What convinced me that in the Java code, when I receive a
failure
tuple, I look for matches on thesuccess
side, but then I never store thefailure
tuple. And I can't, because the logic that needs to see thatfailure
tuple lives in a completely differentBiFunction
, in a completely different join. Because I don't store that tuple, it's possible for me to miss matches.My conclusion, then, is that I can't implement a deterministic join where I maintain my own windows. I think that an interface like
TStream<T>.join(TStream<U> other)
would allow me to do that. I'm not entirely sure how it would work - at first it seems like it would be okay for users to provide aBiFunction
as before, but the problem is that interface does not provide a way for the user to know which side received the tuple. We could ask users to provide twoBiFunction
s, one for each side, but that may be going to much towards the SPL side. (Since they're essentiallyonTuple Left
andonTuple Right
.)I know this may not be a high priority at the moment, as we're focusing on making the simple things simple, and this is not a simple thing.