Deltares-research / kenmerkendewaarden

Derive indicators from waterlevel measurements
https://deltares-research.github.io/kenmerkendewaarden/
GNU General Public License v3.0
1 stars 0 forks source link

`compute_expected_counts` gives incorrect values for HANSWT ext #88

Open veenstrajelmer opened 1 month ago

veenstrajelmer commented 1 month ago

The expected counts are higher than the actual counts for HANSWT extremes, even though there are no nans or missing timesteps. This is because the frequency in the expected counts computation is based on the median instead of the mean. For HOEKVHLD this goes fine since the expected values are lower than the actual, but for HANSWT the median freq results in expected values higher than the actual number of values.

import os
import pandas as pd
import hatyan
import kenmerkendewaarden as kw
from kenmerkendewaarden.tidalindicators import compute_actual_counts, compute_expected_counts

# set logging level to INFO to get log messages
import logging
logging.getLogger("kenmerkendewaarden").setLevel(level="INFO")

tstop_dt = pd.Timestamp(2021,1,1, tz="UTC+01:00")

dir_base = r'p:\11210325-005-kenmerkende-waarden\work'
dir_meas = os.path.join(dir_base,'measurements_wl_18700101_20240101')
# TODO: move to full data folder (otherwise overschrijding and slotgemiddelde is completely wrong)
# dir_meas = os.path.join(dir_base,'measurements_wl_20101201_20220201')
dir_meas = r"c:\Users\veenstra\Downloads\measurements_wl_18700101_20240101"
current_station = 'HANSWT'

print("loading meas")
data_pd_HWLW_all = kw.read_measurements(dir_output=dir_meas, station=current_station, extremes=True)
data_pd_HWLW_all_12 = hatyan.calc_HWLW12345to12(data_pd_HWLW_all) #convert 12345 to 12 by taking minimum of 345 as 2 (laagste laagwater)

# computing counts
act_count_peryear = compute_actual_counts(data_pd_HWLW_all_12, freq="Y")
exp_count_peryear = compute_expected_counts(data_pd_HWLW_all_12, freq="Y")

# the max timediff is 9 hours and there are no nans, so there are no gaps
print("num nans:", data_pd_HWLW_all_12["values"].isnull().sum())
print("\nmax timediff:", (data_pd_HWLW_all_12.index[1:] - data_pd_HWLW_all_12.index[:-1]).max())

# however, the expected counts are higher because we compute the median frequency, not the mean
print("\nactual counts")
print(act_count_peryear)

print("\nexpected counts")
print(exp_count_peryear)

Gives:

num nans: 0

max timediff: 0 days 09:00:00

actual counts
time
1880     711
1881    1410
1882    1411
1883    1411
1884    1414

2019    1410
2020    1416
2021    1410
2022    1411
2023       3
Freq: A-DEC, Name: values, Length: 144, dtype: int64

expected counts
time
1880    1424.432432
1881    1420.540541
1882    1420.540541
1883    1420.540541
1884    1424.432432

2019    1420.540541
2020    1424.432432
2021    1420.540541
2022    1420.540541
2023    1401.600000
Freq: A-DEC, Length: 144, dtype: float64

Todo: