Open hazelnut-99 opened 4 months ago
Thanks for reporting. There are at least four possible definitions for "current time" in arroyo
Currently arroyo just implements the first one (via now()
), but others would be useful as well. The second one is easy to get in a UDF today, but the other two would need to be built in.
In Flink, now()
is the current clock time (2) in streaming mode, but query start time (1) in batch mode. We might want to adopt that behavior, but I think the current version of now() is also useful in some cases and is worth preserving. Adding functions for 3 and 4 would also be great (in Flink 4 is CURRENT_WATERMARK
which seems like a fine name).
Writing new functions is pretty straightforward so this could be a great first issue for someone.
I'm interested in obtaining the current event time of a row (i.e. 3 above). Tried to look quickly how this could be implemented, but couldn't find a similar function?
(~I actually was not able to find the implementation of any of the standard functions in the source code~ I just found extract_json_string
)
@mwylde - can you give me some guidance, where to look at?
Can something like row_time()
be implemented as an UDF? Is there a UDF type that has access to the row "attributes" like the event_time?
Or should the _timestamp
field, I see in the source code, just be somehow made accessible over SQL?
Hey @kpe — happy to help walk through how you'd implement this. row_time() (i.e., 3 in the list above) can't be implemented as a UDF. This is because the expression trees are planned from SQL with DataFusion, which doesn't know about the _timestamp field as its not actually part of the SQL schema. We add it in as part of various rewrites of the plan (you can find the various calls to add_timestamp_field) throughout arroyo-planner).
So instead we would need to define a placeholder UDF, like hop/tumble/session:
This would then need to be rewritten (by traversing through the logical plan and all expressions) into an expression that gets the _timestamp field.
Hi! I am trying to add a processing time timestamp to each row with the following query:
However, it seems that the now() function is executed only once, resulting in a static current_ts field that does not update over time.
Are there any keywords or functions that can generate a real-time timestamp for each record in the output?