Deltares / ddlpy

API to Dutch Rijkswaterstaat archive (DDL, waterinfo.rws.nl) of monitoring water data
https://deltares.github.io/ddlpy/
GNU General Public License v3.0
19 stars 6 forks source link

create to_xarray function #85

Closed veenstrajelmer closed 5 months ago

veenstrajelmer commented 6 months ago

measurements.to_xarray() is possible, but this will give you a huge dataset and that is a bit inconvenient. ddlpy.simplify_dataframe(measurements).to_xarray() is better, but it will still give you all codes/omschrijvingen separately. For BAALHK 1980-2024 this gives a dataset of >1GB mainly thanks to the WaardeBepalingsmethode.Omschrijving variable.

Data variables:
    Status                                            (time) <U15 156MB ...
    WaarnemingMetadata.OpdrachtgevendeInstantieLijst  (time) <U12 125MB ...
    QC                                                (time) <U2 21MB ...
    AquoMetadata_MessageID                            (time) int64 21MB ...
    WaardeBepalingsmethode.Code                       (time) <U10 104MB ...
    WaardeBepalingsmethode.Omschrijving               (time) <U65 675MB ...
    waterlevel                                        (time) float64 21MB ...
    station_name                                      |S64 64B ...
    station_id                                        |S64 64B ...
    station_x_coordinate                              float64 8B ...
    station_y_coordinate                              float64 8B ...

Todo:

Example code that retrieves VLISSGN data with multiple Waardebepalingsmethode (changed on 07-09-1993):

import ddlpy
locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['VLISSGN'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

tstart_dt = "1993-08-25 09:47:00" #VLISSGN got new Waardebepalingsmethode in this year
tstop_dt = "1994-11-30 09:50:00"

measurements = ddlpy.measurements(selected.iloc[0], tstart_dt, tstop_dt)
simple = ddlpy.simplify_dataframe(measurements)
list_cols = ['WaardeBepalingsmethode.Code','WaardeBepalingsmethode.Omschrijving']
measurements[list_cols].drop_duplicates()

Alternative example (already works):

import ddlpy
locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['BAALHK'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

date_start = "1990-02-15"
date_end = "1990-03-15"
measurements = ddlpy.measurements(selected.iloc[0], date_start, date_end)
print(measurements['WaardeBepalingsmethode.Code'].drop_duplicates())

simple = ddlpy.simplify_dataframe(measurements)
ds1 = measurements.to_xarray()
ds2 = simple.to_xarray()
print(f'full [MB]: {ds1.nbytes/1024**2:.3f}')
print(f'simple [MB]: {ds2.nbytes/1024**2:.3f}')

Prints:

time
1990-02-27 01:00:00+01:00    other:F039
1990-03-01 00:00:00+01:00    other:F027
Name: WaardeBepalingsmethode.Code, dtype: object

full [MB]: 1.801
simple [MB]: 0.267

Update 27-3-2023: With the catalog_filter argument added to ddlpy.locations() or the private ddlpy.ddlpy.catalog() function added in https://github.com/Deltares/ddlpy/pull/87, retrieving the extended catalog is easy. We can also subset the catalog dataframe directly:

import pandas as pd
import ddlpy

catalog_filter = ['Compartimenten','Eenheden','Grootheden',
                  'Hoedanigheden','Groeperingen','MeetApparaten',
                  'Typeringen','WaardeBepalingsmethoden','WaardeBewerkingsmethoden','Parameters']

if 1:
    result = ddlpy.ddlpy.catalog(catalog_filter=catalog_filter)
    cat_aquometadatalijst = pd.json_normalize(result["AquoMetadataLijst"])
else:
    locations = ddlpy.locations(catalog_filter=catalog_filter)
    cat_aquometadatalijst = locations
    # bool_grootheid = locations["Grootheid.Code"].isin(["WATHTE"])
    # bool_groepering = locations["Groepering.Code"].isin(["NVT"])
    # selected = locations.loc[bool_grootheid & bool_groepering]
    # cat_aquometadatalijst = selected

key_list_full = [x.replace(".Code","") for x in cat_aquometadatalijst.columns if x.endswith(".Code")]
# >> ['Compartiment', 'Eenheid', 'Grootheid', 'Hoedanigheid', 'MeetApparaat', 'Parameter', 'Typering', 'Groepering', 'WaardeBepalingsmethode']
for key in key_list_full:
    df_meta_unique = cat_aquometadatalijst[[f"{key}.Code",f"{key}.Omschrijving"]].drop_duplicates()
    print(f'\n{len(df_meta_unique)} unique {key} available in requested subset')
    print(df_meta_unique)

    dict_meta_unique = df_meta_unique.set_index(f"{key}.Code")[f"{key}.Omschrijving"].to_dict()

This can be used for adding metadata attrs, but the full set of Waardebepalingsmethode and Parameter are way to large to add:

7 unique Compartiment available in requested subset
51 unique Eenheid available in requested subset
104 unique Grootheid available in requested subset
71 unique Hoedanigheid available in requested subset
34 unique MeetApparaat available in requested subset
895 unique Parameter available in requested subset
6 unique Typering available in requested subset
5 unique Groepering available in requested subset
510 unique WaardeBepalingsmethode available in requested subset
6 unique WaardeBewerkingsmethode available in requested subset

Instead, the unique values from the subsetted extended locations dataframe can be used. This is better than using the unique values from the measurements dataframe, since these will differ per station and time range.