`compute_expected_counts` gives incorrect values for HANSWT ext

The expected counts are higher than the actual counts for HANSWT extremes, even though there are no nans or missing timesteps. This is because the frequency in the expected counts computation is based on the median instead of the mean. For HOEKVHLD this goes fine since the expected values are lower than the actual, but for HANSWT the median freq results in expected values higher than the actual number of values.

import os
import pandas as pd
import hatyan
import kenmerkendewaarden as kw
from kenmerkendewaarden.tidalindicators import compute_actual_counts, compute_expected_counts

# set logging level to INFO to get log messages
import logging
logging.getLogger("kenmerkendewaarden").setLevel(level="INFO")

tstop_dt = pd.Timestamp(2021,1,1, tz="UTC+01:00")

dir_base = r'p:\11210325-005-kenmerkende-waarden\work'
dir_meas = os.path.join(dir_base,'measurements_wl_18700101_20240101')
# TODO: move to full data folder (otherwise overschrijding and slotgemiddelde is completely wrong)
# dir_meas = os.path.join(dir_base,'measurements_wl_20101201_20220201')
dir_meas = r"c:\Users\veenstra\Downloads\measurements_wl_18700101_20240101"
current_station = 'HANSWT'

print("loading meas")
data_pd_HWLW_all = kw.read_measurements(dir_output=dir_meas, station=current_station, extremes=True)
data_pd_HWLW_all_12 = hatyan.calc_HWLW12345to12(data_pd_HWLW_all) #convert 12345 to 12 by taking minimum of 345 as 2 (laagste laagwater)

# computing counts
act_count_peryear = compute_actual_counts(data_pd_HWLW_all_12, freq="Y")
exp_count_peryear = compute_expected_counts(data_pd_HWLW_all_12, freq="Y")

# the max timediff is 9 hours and there are no nans, so there are no gaps
print("num nans:", data_pd_HWLW_all_12["values"].isnull().sum())
print("\nmax timediff:", (data_pd_HWLW_all_12.index[1:] - data_pd_HWLW_all_12.index[:-1]).max())

# however, the expected counts are higher because we compute the median frequency, not the mean
print("\nactual counts")
print(act_count_peryear)

print("\nexpected counts")
print(exp_count_peryear)

Gives:

num nans: 0

max timediff: 0 days 09:00:00

actual counts
time
1880     711
1881    1410
1882    1411
1883    1411
1884    1414

2019    1410
2020    1416
2021    1410
2022    1411
2023       3
Freq: A-DEC, Name: values, Length: 144, dtype: int64

expected counts
time
1880    1424.432432
1881    1420.540541
1882    1420.540541
1883    1420.540541
1884    1424.432432

2019    1420.540541
2020    1424.432432
2021    1420.540541
2022    1420.540541
2023    1401.600000
Freq: A-DEC, Length: 144, dtype: float64

Todo:

[x] consider using scipy.stats.trim_mean, which exludes x percentiles of date before taking the mean. This probably gives the desired behaviour for frequency estimation. >> This still gives values that are slightly too high, so does not give the desired behaviour.
[ ] consider only counting HW's in case of extremes, the tidal periods might be more constant than duur stijging/daling. For extremes with aggers the current approach would not work also, but this alternative might. Unfortunately, there are still 12 years that result in higher expected counts than actual counts, even after using .floor() on the expected counts
[ ] add testcase for this edgecase. Or add a testcase for hoekvhld extremes, since expected counts are too low there (because of same reason, but toolittle did not cause a problem so was not seen)

Deltares-research / kenmerkendewaarden

`compute_expected_counts` gives incorrect values for HANSWT ext #88