databrickslabs / tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
https://pypi.org/project/dbl-tempo
Other
310 stars 53 forks source link

tsdf.show() or display(tsdf) only displays the first 5 rows #326

Open nikhilmakan02 opened 1 year ago

nikhilmakan02 commented 1 year ago

Could somebody tell me if I am missing something here. Really simple code I am using in a Databricks notebook. Tempo version 0.1.23

%pip install dbl-tempo

from tempo import * import pandas as pd

df = pd.DataFrame({'ts':pd.date_range(start='2018-04-24 00:00:00', end='2018-04-25 12:00:00',freq='1H')}).assign(val=3) dfs = spark.createDataFrame(df) dfs.show() # -----> This shows me 20 rows

test_tsdf = TSDF(dfs, ts_col="ts") test_tsdf.show(20) # -----> This only shows 5 rows.

display(test_tsdf) # -----> This only shows 5 rows.

Any thoughts?

tnixon commented 1 year ago

Oh, that's strange. Yes, this is definitely inconsistent behavior @nikhilmakan02 - thanks for letting us know! I'll see if I can put together a quick fix for this so that the show and display methods are more consistent with the standard Spark DF show method.

nikhilmakan02 commented 1 year ago

Thanks @tnixon I have renamed the issue to more accurately describe the problem as I initially thought it was the conversion from a Spark Dataframe to TSDF that only returned 5 rows.

Workaround is just calling the '.df.show()' on the tsdf object and it works fine.

Thanks.

Melissari1997 commented 10 months ago

Hi @tnixon I took a look at the repo for a personal project and I would like to try to contribute Do you mind if I try to solve this problem?

tnixon commented 10 months ago

Hi @Melissari1997 - if you want to fix the issue, please go ahead. We welcome outside contributions. Just submit a PR and we'll take a look at it.