The expected counts are higher than the actual counts for HANSWT extremes, even though there are no nans or missing timesteps. This is because the frequency in the expected counts computation is based on the median instead of the mean. For HOEKVHLD this goes fine since the expected values are lower than the actual, but for HANSWT the median freq results in expected values higher than the actual number of values.
import os
import pandas as pd
import hatyan
import kenmerkendewaarden as kw
from kenmerkendewaarden.tidalindicators import compute_actual_counts, compute_expected_counts
# set logging level to INFO to get log messages
import logging
logging.getLogger("kenmerkendewaarden").setLevel(level="INFO")
tstop_dt = pd.Timestamp(2021,1,1, tz="UTC+01:00")
dir_base = r'p:\11210325-005-kenmerkende-waarden\work'
dir_meas = os.path.join(dir_base,'measurements_wl_18700101_20240101')
# TODO: move to full data folder (otherwise overschrijding and slotgemiddelde is completely wrong)
# dir_meas = os.path.join(dir_base,'measurements_wl_20101201_20220201')
dir_meas = r"c:\Users\veenstra\Downloads\measurements_wl_18700101_20240101"
current_station = 'HANSWT'
print("loading meas")
data_pd_HWLW_all = kw.read_measurements(dir_output=dir_meas, station=current_station, extremes=True)
data_pd_HWLW_all_12 = hatyan.calc_HWLW12345to12(data_pd_HWLW_all) #convert 12345 to 12 by taking minimum of 345 as 2 (laagste laagwater)
# computing counts
act_count_peryear = compute_actual_counts(data_pd_HWLW_all_12, freq="Y")
exp_count_peryear = compute_expected_counts(data_pd_HWLW_all_12, freq="Y")
# the max timediff is 9 hours and there are no nans, so there are no gaps
print("num nans:", data_pd_HWLW_all_12["values"].isnull().sum())
print("\nmax timediff:", (data_pd_HWLW_all_12.index[1:] - data_pd_HWLW_all_12.index[:-1]).max())
# however, the expected counts are higher because we compute the median frequency, not the mean
print("\nactual counts")
print(act_count_peryear)
print("\nexpected counts")
print(exp_count_peryear)
[x] consider using scipy.stats.trim_mean, which exludes x percentiles of date before taking the mean. This probably gives the desired behaviour for frequency estimation. >> This still gives values that are slightly too high, so does not give the desired behaviour.
[ ] consider only counting HW's in case of extremes, the tidal periods might be more constant than duur stijging/daling. For extremes with aggers the current approach would not work also, but this alternative might. Unfortunately, there are still 12 years that result in higher expected counts than actual counts, even after using .floor() on the expected counts
[ ] add testcase for this edgecase. Or add a testcase for hoekvhld extremes, since expected counts are too low there (because of same reason, but toolittle did not cause a problem so was not seen)
The expected counts are higher than the actual counts for HANSWT extremes, even though there are no nans or missing timesteps. This is because the frequency in the expected counts computation is based on the median instead of the mean. For HOEKVHLD this goes fine since the expected values are lower than the actual, but for HANSWT the median freq results in expected values higher than the actual number of values.
Gives:
Todo:
.floor()
on the expected counts