earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
349 stars 54 forks source link

Is this the "best" way of getting the latest observations for all stations? #830

Closed guidocioni closed 1 year ago

guidocioni commented 1 year ago

Hiya, just went back to Wetterdienst today as I wanted to get the latest measured values for air temperature and all stations in the network.

I came up with the following snippet

stations = DwdObservationRequest(
    parameter=DwdObservationDataset.TEMPERATURE_AIR,
    resolution=Resolution.MINUTE_10,
    period=Period.NOW,
    start_date=pd.to_datetime("now", utc=True) - pd.to_timedelta("30 min"),
    end_date=pd.to_datetime("now", utc=True)
)

stations = stations.all().df
ids = stations.station_id.values

observations = DwdObservationRequest(
    parameter=DwdObservationDataset.TEMPERATURE_AIR,
    resolution=Resolution.MINUTE_10,
    period=Period.NOW
)

df = observations.filter_by_station_id(station_id=ids).values.all().df

df = df.merge(stations[['station_id','height','latitude','longitude','name','state']], left_on='station_id', right_on='station_id')
df = df.groupby("station_id").apply(lambda x: x.sort_values(by='date').tail(1)).drop(columns=['station_id']).reset_index()

Which gives as expected

station_id level_1 dataset date quality pressure_air_site temperature_air_mean_200 temperature_air_mean_005 humidity temperature_dew_point_mean_200 height latitude longitude name state
00044 41 temperature_air 2022-12-19 06:50:00+00:00 2.0 NaN 2.5 1.9 96.9 2.1 44.0 52.9336 8.2370 Großenkneten Niedersachsen

This, however, seems a little bit too complicated to only get the latest measured values. Is there a better and more concise way? :)

P.S. For some reason the table of parameters in https://github.com/earthobservations/wetterdienst/edit/main/docs/data/coverage/dwd/observation.rst does not render

gutzbenj commented 1 year ago

Dear @guidocioni ,

thanks for coming back to us!

I propose the following workflow:

import pandas as pd
import duckdb

from wetterdienst import Resolution, Period
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset

stations = DwdObservationRequest(
    parameter=DwdObservationDataset.TEMPERATURE_AIR,
    resolution=Resolution.MINUTE_10,
    period=Period.NOW,
    start_date=pd.to_datetime("now", utc=True) - pd.to_timedelta("120 min"),
    end_date=pd.to_datetime("now", utc=True)
).all()

values = stations.values.all().df

sql = """
    SELECT
        *
    FROM 
        df 
    WHERE 
        value IS NOT NULL
    QUALIFY
        ROW_NUMBER() OVER(PARTITION BY station_id, parameter ORDER BY date DESC) = 1
"""

values_latest = duckdb.query_df(values, "df", sql).df()

Notes:

Cheers, Benjamin

PS: I'm once again surprise how fast and easy this query runs, although much of it is done in Python!

amotl commented 1 year ago

Wow. What's that? ;]

QUALIFY
    ROW_NUMBER() OVER(PARTITION BY station_id, parameter ORDER BY date DESC) = 1

In any case, if that would be the most convenient procedure to get the latest observations for all stations, it would be sweet to provide an even more convenient accessor method for that.

I don't quite know where the right place is for that, but I am sure you will find the right spot, @gutzbenj. What about wrapping the DuckDB query into something like stations.values.all().not_null().first(), and wrapping the start_date and end_date parameters into a synthetic Period.LATEST?

This proposal might shorten the procedure to that snippet.

stations = DwdObservationRequest(
    parameter=DwdObservationDataset.TEMPERATURE_AIR,
    resolution=Resolution.MINUTE_10,
    period=Period.LATEST
).all()

values_latest = stations.values.all().not_null().first().df

Filtering data within pandas feels nasty for this large amount so we use DuckDB to query the data sql-a-like.

Maybe reconsider this and keep using pandas-based filtering and mangling instead?

guidocioni commented 1 year ago

Wow. What's that? ;]

QUALIFY
    ROW_NUMBER() OVER(PARTITION BY station_id, parameter ORDER BY date DESC) = 1

In any case, if that would be the most convenient procedure to get the latest observations for all stations, it would be sweet to provide an even more convenient accessor method for that.

I don't quite know where the right place is for that, but I am sure you will find the right spot, @gutzbenj. What about wrapping the DuckDB query into something like stations.values.all().not_null().first(), and wrapping the start_date and end_date parameters into a synthetic Period.LATEST?

This proposal might shorten the procedure to that snippet.

stations = DwdObservationRequest(
    parameter=DwdObservationDataset.TEMPERATURE_AIR,
    resolution=Resolution.MINUTE_10,
    period=Period.LATEST
).all()

values_latest = stations.values.all().not_null().first().df

Filtering data within pandas feels nasty for this large amount so we use DuckDB to query the data sql-a-like.

Maybe reconsider this and keep using pandas-based filtering and mangling instead?

Thanks, what I wanted to get rid of was the part where I was using the ids from the stations to do the second query: I forgot that we can just use the all part. I second the suggestion to have a dedicated endpoint...in the end getting the latest, most updated values for all stations might be a pretty common use case.