databrickslabs / tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
https://pypi.org/project/dbl-tempo
Other
309 stars 53 forks source link

withRangeStats not recognizing Timestamp field #397

Closed justinkolpak-databricks closed 4 months ago

justinkolpak-databricks commented 7 months ago

Issue: When a Timestamp column is specified as the ts_col for a tsdf, it does not get accurately interpreted as a Timestamp field in the logic that handles rangeBackWindowSecs.

Root Cause: In tsdf.py, def withRangeStats(), the following code never evaluates to True because the str representation of the dataType is TimestampType(), while the code expects TimestampType. The line causing the error is 1105: if str(self.df.schema[self.ts_col].dataType) == "TimestampType":

Setup: tsdf = tempo.TSDF(df, ts_col='<timestamp_column>') tsdf_2 = tsdf.withRangeStats("SIDE", rangeBackWindowSecs=300).df

Error: Cannot resolve "(PARTITION BY <partition_col> ORDER BY DATE_TIME ASC NULLS FIRST RANGE BETWEEN -300 FOLLOWING AND CURRENT ROW)" due to data type mismatch: The data type "TIMESTAMP" used in the order specification does not match the data type "BIGINT" which is used in the range frame. SQLSTATE: 42K09;

tnixon commented 7 months ago

How very annoying! This is definitely not the right way to test the dataType anyway. I'll see about updating this ASAP!

justinkolpak-databricks commented 7 months ago

Thanks!

tnixon commented 4 months ago

Closed as per PR #400