Rijkswaterstaat / wm-ws-dl

wm-ws-dl documentation
https://rijkswaterstaatdata.nl/waterdata
11 stars 2 forks source link

Missing/duplicated values in WATHTE timeseries and extremes #39

Open veenstrajelmer opened 3 months ago

veenstrajelmer commented 3 months ago

https://github.com/Rijkswaterstaat/wm-ws-dl/issues/38 describes missing extremes for some stations. Within the kenmerkendewaarden project there was a more extensive analysis carried out (station_list_tk, period 1870-01-01 to 2024-01-01). This is purely based on the OphalenAantalWaarnemingen feature of the DDL and the code is available on github. This data was retrieved on 9-4-2024, so after the corrections in DDL data on 22-02-2024.

For extremes, we expect 4*365=1460 values for four-daily low/high extremes. More could indicate duplicates or aggers, less means missings. The same for timeseries, here we expect 24*365=8760 to 6*24*365=52560 values for 60min/10min interval. All stations with more values have duplicated data. Some stations cannot be retrieved per year, since the maximum allowed number of values by the DDL is 157681. Therefore, all stations have to be retrieved per month instead which is often quite inefficient.

These amounts are visualized with a relative coverage per year. For extremes: image Cases with more than 100% coverage like HOEKVHLD, can probably be explained by presence of aggers. However, for SCHEVNGN and STELLDBTN this is not consistent over time. There are also stations with no ext stations present at all (A12, AWGPFM, BAALHK, GATVBSLE, D15, F16, F3PFM, J6, K14PFM, L9PFM, MAASMSMPL, NORTHCMRT, OVLVHWT, Q1, SINTANLHVSGR, WALSODN), is this expected?

For timeseries: image

Raw data csv's with the amount of measurements per year and per station: data_amount_ext.csv data_amount_ts.csv

Additional issue The stations D15, J6 and NES only show very recent data, since for these stations accidentally the realtime instead of the historic station is selected. https://github.com/Rijkswaterstaat/wm-ws-dl/issues/20 would solve this issue.

TvLoon-RWS commented 1 month ago

This issue will be checked with the long-term data storage

KDoekes-RWS commented 1 month ago

This is all known and explainable. The number of HW/LW's in a normal year for a non-agger location is 1410 or 1411. The series of course didn't all start on January 1st of a year, and there are als cases when during long times only HW's were available for a tide gauge (e.g. Cadzand) of only daylight HW/LW's (e.g. Oudeschild). Equidistnat time series with too much or too little data cab not be stored in DONAR. There are some cases of storage of more than one series for the same period, with different time steps, such as 10 minutes and 1 minutes for the 'basisstations' (Vlissingen, Hoek van Holland, IJmuiden buitenhaven, Harlingen and Delfzijl) during the last years, and cases where more than one series for the same period with different manager codes (Dutch: instantiecodes) were toreed, mainly in the Delta area. The data of a number of relatively new offshore gauges are not validated, and no HW/LW data are computed. The gauges in the Dekta area mentioned for which no extremes were computed at all are not considered part of the MWTL program, either.