Open kjw03 opened 3 years ago
Hi thanks for the question! We do not yet support structured streaming - but do you have a use case you're interested in? For example streaming AS OF joins?
On Mon, Jun 7, 2021 at 11:37 PM kjw03 @.***> wrote:
Hi team, I just watched a talk from Ricardo and Tristan on Mar 16, 2021. At the end there was a question from the audience about support for streaming. It seemed the answer was not at this time, and Tempo was primarily intended for batch use cases at the time. Does Tempo now support structured streaming? If not, is it on the road map?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/databrickslabs/tempo/issues/70, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCRAXFP5GKK6KO55S7TKL3TRWF6ZANCNFSM46I6W43Q .
--
Ricardo Portilla
Lead Solutions Architect
Databricks Inc.
@.***
databricks.com
Hi Ricardo,
I'm looking at financial data rather than, for example, sensor data so I'm not getting data at regular intervals. However, I need to provide summaries & features at regular intervals. My first cut was to create a time_ticks DataFrame as shown below.
And then join the financial data in the analysis time window for each time tick as shown below.
As you may have guessed, this approach has terrible performance. To improve it, I thought I would (with trades data as an example): 1) From the irregularly spaced trades data, create lookback vectors for the event_timestamp and feature columns (e.g., price, volume, taker_side) using a RANGE frame over the event_timestamp the length of the selected analysis time window size 2) Create a new DataFrame with the selectively spaced time ticks of interest and do an AS OF type join to get the columns holding the lookback vectors 3) Run my aggregations on the lookback arrays, in each case only processing those elements that are within analysis_time_window_seconds from the time tick rather than the original event I used to construct the vector
This led me to tempo, which I was pleasantly surprised to find. It seems there are two requests from here in order to support the use case described above.
1) Can we adjust tempo to account not just for regularly spaced/sampled time series data like you might see in an IOT ecosystem, but providing support for creating statistics on a specified cadence using overlapping analysis windows of a specified size? 2) Can we extend tempo to support structured streaming?
I'll keep looking into this but I'm certainly curious what your thoughts are in regards to how aligned this type of work is with your vision for the library.
Just to clarify, in the above use case, I'm building a continuous application.
Thanks for the detailed explanation! We can help with this - would you mind sending over your email so I can set up a meeting.
On Tue, Jun 8, 2021 at 2:01 PM kjw03 @.***> wrote:
Just to clarify, in the above use case, I'm building a continuous application.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/databrickslabs/tempo/issues/70#issuecomment-856978096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCRAXHI4Z6U3HOAP2PME2LTRZLG3ANCNFSM46I6W43Q .
--
Ricardo Portilla
Lead Solutions Architect
Databricks Inc.
@.***
databricks.com
@Sonali-guleria , just for reference.
@Sonali-guleria @tnixon Is there any progress here ?
When using resample
and interpolate
inside the foreachBatch
function, I get an error message. (See Issue https://github.com/databrickslabs/tempo/issues/192#issuecomment-1160196626)
Also the use of ASOF Join with streams leads to an error message:
AttributeError Traceback (most recent call last)
<command-3872531940924524> in <module>
----> 1 stream_stream_asof_join()
<command-2772754795498460> in stream_stream_asof_join()
12 ts_col="logging_timestamp", partition_cols = ["logger_trip_counter"])
13
---> 14 joined_df = can_tsdf.asofJoin(analog_tsdf,
15 left_prefix="can",
16 right_prefix="analog").df
/local_disk0/.ephemeral_nfs/envs/pythonEnv-bff3f78d-b0ef-4734-b3ca-e83e3c63113f/lib/python3.8/site-packages/tempo/tsdf.py in asofJoin(self, right_tsdf, left_prefix, right_prefix, tsPartitionVal, fraction, skipNulls, sql_join_opt, suppress_null_warning)
388
389 spark = (SparkSession.builder.getOrCreate())
--> 390 left_bytes = self.__getBytesFromPlan(left_df, spark)
391 right_bytes = self.__getBytesFromPlan(right_df, spark)
392
/local_disk0/.ephemeral_nfs/envs/pythonEnv-bff3f78d-b0ef-4734-b3ca-e83e3c63113f/lib/python3.8/site-packages/tempo/tsdf.py in __getBytesFromPlan(self, df, spark)
348 import re
349
--> 350 result = re.search(r"sizeInBytes=.*(['\)])", plan, re.MULTILINE).group(0).replace(")", "")
351 size = result.split("=")[1].split(" ")[0]
352 units = result.split("=")[1].split(" ")[1]
AttributeError: 'NoneType' object has no attribute 'group'
Hi @sim-san : yes, soon we are releasing streaming support for tempo.
Hi, I had a call with our Databricks account team and SA today around our timeseries use case. We have a need to do timeseries 'temporal' joins a big financial service use case, and also as we are big DLT users, looking to see how this can be done natively in DLT.
He pointed us at this repo as the best place to start - he also mentioned this issue, making it available in structured streaming (soon) and by association making it possible in DLT (when upgraded to 13.2) in some incremental mode.
What is the current status of this release please? Hope we get to see a non-trivial example in DLT as well as that would be super - our use case is joining 4 streams - 2 faster moving intraday (every few hours), the other 2 slowly changing reference data (once a week). Is there a tentative release date for this issue? Thanks heaps!
Hi team, I just watched a talk from Ricardo and Tristan on Mar 16, 2021. At the end there was a question from the audience about support for streaming. It seemed the answer was not at this time, and Tempo was primarily intended for batch use cases at the time. Does Tempo now support structured streaming? If not, is it on the road map?