Closed hchagani-oasislmf closed 1 year ago
The issue lies in L1233-1242 in aggereports.cpp
:
for (auto s : items) {
lossvec2 &lpv = s.second;
std::sort(lpv.rbegin(), lpv.rend());
for (auto lp : lpv) {
lossval &l = mean_map[s.first.summary_id][lp.period_no-1];
l.period_no = lp.period_no;
l.period_weighting = lp.period_weighting;
l.value += lp.value;
}
}
The loss vector is sorted in descending order, then reorganised by period number, thus nullifying the sorting. This only works in the rare cases when the loss value always decreases with increasing period number.
In the following randomly generated data set, periods 1 to 3 have a weight of 0.1, and period 4 has a weight of 0.7. For sidx = 1
and sidx = 2
, periods 3 and 4 have the largest losses respectively. Because of the non-uniform period weights, the return period (rp
) values are not identical between samples:
period weight loss sidx cum_weight rp
0 3 0.1 997.513245 1 0.1 10.000000
1 1 0.1 843.988044 1 0.2 5.000000
2 2 0.1 622.320179 1 0.3 3.333333
3 4 0.7 430.553659 1 1.0 1.000000
4 4 0.7 985.415658 2 0.7 1.428571
5 2 0.1 877.047875 2 0.8 1.250000
6 1 0.1 860.664077 2 0.9 1.111111
7 3 0.1 823.060403 2 1.0 1.000000
Given the above, how should the mean be determined for return periods that only occur when sidx = 1
(i.e. return periods 10.0, 5.0 and 3.33)? Should the losses be divided by the total number of samples (in this case 2) or should they be divided by the number of samples where the return period lies between the minimum and maximum return periods of that sample (1 in this example)?
After discussions with @johcarter, we have concluded that it would make most sense to adopt the latter method. We should not interpolate beyond the range. The return periods should be aligned, and the mean can be taken across them.
Adopting this method, we get the following:
sidx rp loss
0 1 10.000000 997.513245
1 1 5.000000 843.988044
2 1 3.333333 622.320179
3 1 1.428571 465.776081
4 1 1.250000 451.100072
5 1 1.111111 439.685398
6 1 1.000000 430.553659
7 2 1.428571 985.415658
8 2 1.250000 877.047875
9 2 1.111111 860.664077
10 2 1.000000 823.060403
The per sample means are:
rp loss
0 10.000000 997.513245
1 5.000000 843.988044
2 3.333333 622.320179
3 1.428571 725.595870
4 1.250000 664.073974
5 1.111111 650.174738
6 1.000000 626.807030
This does result in return period 1.429 having a larger loss than return period 3.333. This may just be a quirk of the randomly generated data set, and we think it is unlikely to see this situation in real data.
As the return periods are dependent on the period weights, the aforementioned solution would require that the data be traversed twice: once to determine the return periods; and the second time to fill them. However, if the return periods are known in advance, i.e. when the user supplies a return period file, the first iteration is unnecessary.
To our knowledge, there are no users who are interested in the per sample mean with period weights but no return periods file, and it does not appear to be a very useful metric. Unless there is an explicit desire from the user base, we have decided it would be better not to dedicate too much development time to this. This shall be kept under review, and can be revisited if this causes a problem in the future. The Tail Value at Risk (TVaR) is a casualty of this decision, as it is only meaningful if the losses for all possible return periods are generated.
Therefore, it has been decided that support for the per sample mean with period weights will only be provided if a return periods file is present, and the TVaR will not be calculated.
Issue Description
When using period weights, the Wheatsheaf/per sample mean output from
leccalc
andordleccalc
is incorrect.Steps to Reproduce (Bugs only)
ktools
repository and compile as usual.bash runtests.sh
from withinktools/ktest/
directory to generate test results inktools/ktest/testout/
directory. Rename this directory toktools/ktest/testout_control/
for example so it is not overwritten in the next steps.ktools/examples/input/
directory, containing 10,000 periods, each with weight 0.0001 (= 1/10000).bash runtests.sh
from withinktools/ktest/
directory.testout/
andtestout_control/
directories. Most discrepancies can be explained by floating point errors. However, results from Wheatsheaf/per sample mean cannot.Version / Environment information
Tested in latest ktools v3.9.6. The
aggreports
class was refactored in ktools v3.6.0 (see PR https://github.com/OasisLMF/ktools/pull/195), but these failures can also be seen prior to this change (i.e. ktools <= v3.5.1).Example data / logs