CALT estimated standard error in AAL overstates observed sampling error

johcarter commented 2 years ago

Issue Description

Advice from stats gurus would be very welcome on this problem.

The standard error of the AAL estimate in the new report seems to overstate the observed sampling error for a given sample size. Using PiWind, 10 locations, a bootstrap of AAL calculated 100 times with 10 samples produces a standard deviation of 0.6%, versus estimated standard error of 7.8%. While this is great news for the user, it means the CALT (Convergence in Average Loss Table) report is pretty useless as a predictive tool for AAL convergence.

I think the issue is the violation of the i.i.d assumption, in particular the identically distributed assumption. Each year loss observation comes from a particular period that has particular events which have different loss variation. The bigger the event, the bigger the variation in loss. At the other end of the spectrum, we have 2/3 of periods with no events and zero loss variation. This represents a case of extreme heteroscedasticity.

With a bit of googling I have found some methods that correct for model misspecification / iid violation.
https://stat-analysis.netlify.app/the-iid-violation-and-robust-standard-errors.html

Further investigation is needed to improve the estimated standard error and make this report useful.

Steps to Reproduce (Bugs only)

run piwind with 1000 samples with ord output. include alct output in analysis settings "alct_convergence": true
using the gul_S1_splt, calculate AAL for each 10 sample subset. produce 100 AAL estimates
Find the 0.975 and 0.025 quantiles from the 100 AAL observations corresponding to the 95% confidence interval
take the standard deviation of the AAL estimates
compare this value with the standard error from the 10 sample run, which can be found in the new gul_S1_alct report.

Version / Environment information

1.26

Example data / logs

johcarter commented 1 year ago

The mathematical model for partitioning the variance is a random effects model which requires the hazard and vuln factors to be random. However in the Oasis framework, the hazard element is fixed not random (events occurrences assigned to years in a fixed timeline). This means that the hazard element of the AAL variance using the formula in the attached paper does not reduce with increasing samples and does not accurately predict the overall variance in the AAL estimate, which does reduce in proportion to the number of samples under CLT.

The random effects model described in anova_technique_methodolgy may be suitable for other cat loss modelling calculation frameworks where the hazard element is also random so it is attached here for future reference.

thank you to Radek @OasisLMF/impactforecasting for getting to the bottom of this.

In terms of the convergence report I propose dropping the ANOVA fields and estimating the standard error of the AAL using the standard deviation calculated from all annual loss samples. This is s / sqrt( IM) where s is the sample standard deviation of the annual losses for i = 1,2 ... IM (I being the total number of periods and M being the number of samples). Updated proposed reports attached.

ORD_convergence_tables_v6.xlsx

anova_technique_methodology_v1.pdf

FYI @hchagani-oasislmf

benhayes21 commented 1 year ago

drop anova fields from output report

OasisLMF / ktools