confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
41 stars 1.04k forks source link

Force emit single-event session in Session window with "emit final" after grace period expires #9760

Open lightsailpro opened 1 year ago

lightsailpro commented 1 year ago

Is your feature request related to a problem? Please describe.

I have events that I need to sessionize using ksql "window session (15 minutes, grace period 10 minutes)" with "emit final". For those user sessions with only a single event, in version 0.28 of ksql, the complete session will not emit even after the grace period expires. It seems that ksql relies on a new event that belongs to another session to trigger the emit of previous session. But in reality, next session can be days later or does not exist at all.

Describe the solution you'd like

After grace period expires, for window session using "emit final" option, all sessions should be pushed out even those without new events.

Describe alternatives you've considered

"emit change" will emit intermediate result. But It is hard to handle the complete sessions downstream due to the volume. So "emit change" for "session window" use case is not very practical.

Additional context

TheDarkFlame commented 1 year ago

Thought this was related to an issue I had, but its likely a limitation of how things work with event-streaming.

In short. Ksqldb doesn't have a way to know time has passed other than the timestamp of the last message (because wall-time isn't always related to stream time, your stream might be working with data 20 minutes ago. This is also explained in the documentation In short, you're looking at event-time, and wanting it to progress with the ingestion-time.

One workaround that's popularly used is to generate regular heartbeat events with the sole purpose of progressing your stream.

Other libraries, eg flink, use the concept of a watermark to trigger processing of windows, allowing you to create watermarks which progress your stream without actually having events in your stream, however to my knowledge, no such thing exists in ksqldb. For further reading, the flink documentation covers this topic in a fair amount of detail

Still curious to see what the ksqldb team has to say about this, given that this is something I'm also going to have to deal with in some of my projects.

suhas-satish commented 1 year ago

@TheDarkFlame is right. This is not currently on the ksqlDB road map

pri-naik5 commented 1 year ago

This is a very helpful feature to have with EMIT FINAL especially since the windows will be grouped by some keys and it is not possible to generate heartbeat messages especially with kafka-connect ingested data

mmbfreitas commented 1 year ago

+1

BPerry24 commented 1 year ago

@lightsailpro Did you ever find a workaround here?

My use-case is very similar in that the next session, or window in general, could be days later or never even occur at all.

IMHO, the docs are really confusing because they make it seem like this is possible here, specifically:

However, some applications need to take action only on the final result of a windowed computation. Common examples include sending alerts ...

and

ksqlDB offers a clean way to define this logic: after defining your windowed aggregation, you can suppress the intermediate results, emitting the final count for each user when the window is closed.

It even says "when the window is closed", and makes no mention that a new window needs to be opened for the emission to occur.

How could I use this to "send alerts" as the documentation implies, if the next event with the same key doesn't occur regularly or at all?

Maybe this is "common knowledge" for those used to working in streaming applications, but not so much for people newer to the space, and it's not generally intuitive. Maybe the docs could be updated to reflect this nuance?

TheDarkFlame commented 1 year ago

Honestly your best bet would be to emit heartbeat events into the partitions in question.

I agree that this is somewhat specific to streaming data, and when you start thinking about the way kafka handles time (time is not set by the machines in the kafka cluster). It offers distinct benefits, namely that data can be replayed, processed late, and out of order, but will work... It becomes more of a feature for most use cases.

When you do need something external to the events themselves to drive the timing of the topic, you'd typically look to something like heartbeat events, although that does rely on kafka and your heartbeat application itself functioning as expected. Or you could go to an application/framework such as flink, which has a timing concept which is external to kafka. (in flink this is the watermark). Typically this would then mean processing events using the processing timestamp instead of event timestamp (which is usually fine, but given an edge case of your events being delayed by like 2 hours or whatever, the behaviour is a little odd).

In an application I had done for some practice, I had attempted to implement a watermark generator which used the timestamp of the last event as its base, and then incremented the generator timestamp based on the 'wall time' of the application.

All this is to say. You're unlikely to get this feature ever in ksqldb, because it is somewhat the anti-thesis of the design of ksqldb.

Finally, I agree that this might be better elaborated in the documentation, as well as perhaps an example of an alternative.

BPerry24 commented 1 year ago

@TheDarkFlame Thank you for the insight! I'll look into the potential workarounds you mentioned.