Incorrect Values from Wheatsheaf/Per Sample Mean with Period Weights in leccalc/ordleccalc

hchagani-oasislmf commented 1 year ago

Issue Description

When using period weights, the Wheatsheaf/per sample mean output from leccalc and ordleccalc is incorrect.

Steps to Reproduce (Bugs only)

Clone ktools repository and compile as usual.
Run bash runtests.sh from within ktools/ktest/ directory to generate test results in ktools/ktest/testout/ directory. Rename this directory to ktools/ktest/testout_control/ for example so it is not overwritten in the next steps.
Generate period weights file in ktools/examples/input/ directory, containing 10,000 periods, each with weight 0.0001 (= 1/10000).
Run bash runtests.sh from within ktools/ktest/ directory.
A total of 52 tests will fail. Comparisons can be made between files that have failed these tests in the testout/ and testout_control/ directories. Most discrepancies can be explained by floating point errors. However, results from Wheatsheaf/per sample mean cannot.

Version / Environment information

Tested in latest ktools v3.9.6. The aggreports class was refactored in ktools v3.6.0 (see PR https://github.com/OasisLMF/ktools/pull/195), but these failures can also be seen prior to this change (i.e. ktools <= v3.5.1).

Example data / logs

$ diff testout_control/gul_wheatsheaf_mean_aep_2_r.csv testout/gul_wheatsheaf_mean_aep_2_r.csv 
16,24c16,24
< 1,2,10000.000000,352552.093750
< 1,2,7500.000000,324293.812500
< 1,2,5000.000000,296035.562500
< 1,2,1000.000000,173939.984375
< 1,2,500.000000,127654.179688
< 1,2,250.000000,85973.617188
< 1,2,200.000000,73565.835938
< 1,2,100.000000,33744.691406
< 1,2,50.000000,0.000000
---
> 1,2,10000.000000,226476.906250
> 1,2,7500.000000,210211.343750
> 1,2,5000.000000,193945.796875
> 1,2,1000.000000,149813.546875
> 1,2,500.000000,112040.937500
> 1,2,250.000000,73787.015625
> 1,2,200.000000,67999.468750
> 1,2,100.000000,45883.308594
> 1,2,50.000000,1366.349976

hchagani-oasislmf commented 1 year ago

The issue lies in L1233-1242 in aggereports.cpp:

  for (auto s : items) {
    lossvec2 &lpv = s.second;
    std::sort(lpv.rbegin(), lpv.rend());
    for (auto lp : lpv) {
      lossval &l = mean_map[s.first.summary_id][lp.period_no-1];
      l.period_no = lp.period_no;
      l.period_weighting = lp.period_weighting;
      l.value += lp.value;
    }
  }

The loss vector is sorted in descending order, then reorganised by period number, thus nullifying the sorting. This only works in the rare cases when the loss value always decreases with increasing period number.

In the following randomly generated data set, periods 1 to 3 have a weight of 0.1, and period 4 has a weight of 0.7. For sidx = 1 and sidx = 2, periods 3 and 4 have the largest losses respectively. Because of the non-uniform period weights, the return period (rp) values are not identical between samples:

   period  weight        loss  sidx  cum_weight         rp
0       3     0.1  997.513245     1         0.1  10.000000
1       1     0.1  843.988044     1         0.2   5.000000
2       2     0.1  622.320179     1         0.3   3.333333
3       4     0.7  430.553659     1         1.0   1.000000
4       4     0.7  985.415658     2         0.7   1.428571
5       2     0.1  877.047875     2         0.8   1.250000
6       1     0.1  860.664077     2         0.9   1.111111
7       3     0.1  823.060403     2         1.0   1.000000

Given the above, how should the mean be determined for return periods that only occur when sidx = 1 (i.e. return periods 10.0, 5.0 and 3.33)? Should the losses be divided by the total number of samples (in this case 2) or should they be divided by the number of samples where the return period lies between the minimum and maximum return periods of that sample (1 in this example)?

After discussions with @johcarter, we have concluded that it would make most sense to adopt the latter method. We should not interpolate beyond the range. The return periods should be aligned, and the mean can be taken across them.

Adopting this method, we get the following:

    sidx         rp        loss
0      1  10.000000  997.513245
1      1   5.000000  843.988044
2      1   3.333333  622.320179
3      1   1.428571  465.776081
4      1   1.250000  451.100072
5      1   1.111111  439.685398
6      1   1.000000  430.553659
7      2   1.428571  985.415658
8      2   1.250000  877.047875
9      2   1.111111  860.664077
10     2   1.000000  823.060403

The per sample means are:

          rp        loss
0  10.000000  997.513245
1   5.000000  843.988044
2   3.333333  622.320179
3   1.428571  725.595870
4   1.250000  664.073974
5   1.111111  650.174738
6   1.000000  626.807030

This does result in return period 1.429 having a larger loss than return period 3.333. This may just be a quirk of the randomly generated data set, and we think it is unlikely to see this situation in real data.

hchagani-oasislmf commented 1 year ago

As the return periods are dependent on the period weights, the aforementioned solution would require that the data be traversed twice: once to determine the return periods; and the second time to fill them. However, if the return periods are known in advance, i.e. when the user supplies a return period file, the first iteration is unnecessary.

To our knowledge, there are no users who are interested in the per sample mean with period weights but no return periods file, and it does not appear to be a very useful metric. Unless there is an explicit desire from the user base, we have decided it would be better not to dedicate too much development time to this. This shall be kept under review, and can be revisited if this causes a problem in the future. The Tail Value at Risk (TVaR) is a casualty of this decision, as it is only meaningful if the losses for all possible return periods are generated.

Therefore, it has been decided that support for the per sample mean with period weights will only be provided if a return periods file is present, and the TVaR will not be calculated.

OasisLMF / ktools