Time series for multiple stations - fill missing hours with NA?

jh-206 commented 8 months ago

When collecting data for multiple stations, I often get time series of different lengths. It seems that hours with missing data are just removed from the returned dataframe. This makes it incredibly difficult to line up observations in time for multiple spatial locations.

Is there an existing way to handle this in stations_timeseries? My desired behavior would be to return NaN for hours with missing data, and the timestamp would still correspond to a row of the dataframe.

Reproducible example:

params = dict(
    stid=["RFRC2", "JNSC2"],
    vars=["fuel_moisture"],
    start=datetime(2023, 6, 1),
    end=datetime(2023, 6, 30),
)

a = stations_timeseries(**params)

a[0].shape[0] # 693 observations
a[1].shape[0] # 696 observations

blaylockbk commented 8 months ago

SynopticPy will return just the data it gets from the Synoptic API. I purposely don't do any data manipulation and leave it to the use to decide what to do with what's returned.

I think what you want to do is apply df.asfreq() to each of the returned dataframes. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.asfreq.html

blaylockbk commented 8 months ago

Alternatively, you can do a merge or join to get each dataframe to have the same indexes (I don't know the syntax off the top of my head, but I have used this approach before).

jh-206 commented 8 months ago

Right that makes sense if that's how Synoptic returns the data.

asfreq looks promising, thanks

blaylockbk / SynopticPy

Time series for multiple stations - fill missing hours with NA? #53