databrickslabs / tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
https://pypi.org/project/dbl-tempo
Other
309 stars 53 forks source link

nearest_of_join: Nearest timestamp join between two time-series tables #399

Open rotomer opened 7 months ago

rotomer commented 7 months ago

Motivation

It is often necessary to merge two time-series tables based on the closest timestamp rather than using ASOF. This scenario arises, for instance, when the data is coming from two sensors operating simultaneously and transmitting data at the same time interval. We are looking to use Tempo to join the data such that the points are matched based on their smallest time delta because in this case, there is no guarantee that the timestamps of table A will always precede the timestamps of table B (or vice versa).

Example

Table A event_ts a_data
10 x
21 y
29 z
Table B event_ts b_data
10 i
20 ii
31 iii
table_a.nearest_of_join(table_b) event_ts a_data b_data
10 x i
20 y ii
31 z iii

Edge cases and considerations