Closed genedan closed 6 months ago
@genedan can you share the dataset? If you have it readily available?
So I figured out that the data shown in the paper, is the raa
dataset. The sample calculation is shown on page 46. However, I don't think the package is following the math here, will investigate more.
@genedan if you have anything else that might be helpful to me, such as the calculation done in excel? Please post them here. :)
Hey @kennethshsu , sorry for being so late on this. I had uploaded a csv file called 'mack97' that I put in the data folder, but I just realized it's the same as raa
so perhaps we can delete mack97.
The package should follow the calculations, if you could let me know where you found an inconsistency, perhaps I can help out. The image I pasted from page 45 of the paper is a bit confusing because it contains two triangles mashed together.
The triangle produced by the internal variable m1 should match up with the columns with the r_ij
headers. Two ways to produce this are to either include the nan_policy='omit'
argument to the xp.apply_along_axis()
method or to downgrade scipy below 1.10.
Yap, the dataset is totally the raa
, we can remove mack97
later.
It's been a while but I think maybe there are two problems with .ValuationCorrelation()
.
First, is the issue that you described, related to the ranking of the factors.
Second, I think is a bug, which still exists on 0.8.18
.
import chainladder as cl
import pandas as pd
xyz = cl.load_sample("xyz")
xyz["Incurred"].valuation_correlation(p_critical=.1, total=False).z_critical
This gets you an error ValueError: Shape of passed values is (1, 10), indices imply (1, 9)
. I started debugging the code, and that's when I found out that I don't really think the calculation in the package follows Mack's text.
Is this what you see?
Here is my dev branch.
Describe the bug The most recent version of scipy produces NaNs when ranking link ratios in the initialization of a ValuationCorrelation object. The Mack valuation correlation test requires link ratios to be ranked for each pair of development periods, but when using the package, only the first column gets ranked because it's the only one fully populated.
The following output uses the Mack 97 data set, and is an intermediate calculation from the init() function represented by the m1 variable:
Input triangle:
Ranks:
I believe the culprit is a change to the signature of scipy.stats.rankdata(). There is a new parameter called nan_policy which specifies how NaNs should be handled if they appear in the inputs:
Prior to this change, there was no parameter, so I would assume the default way of handling the ranking was to omit the NaNs.
To Reproduce
You will see the following warning, but upon further inspection you will see that the intermediate variable m1 has an incorrect triangle of link ratio ranks:
Expected behavior
We should see all link ratio periods ranked (the ones labeled "r" in the image from the Mack 97 paper):
Adding the argument nan_policy='omit' seems to solve the problem:
Desktop (please complete the following information):