[BUG] ValuationCorrelation produces NaNs in ranking link ratios

genedan commented 1 year ago

Describe the bug The most recent version of scipy produces NaNs when ranking link ratios in the initialization of a ValuationCorrelation object. The Mack valuation correlation test requires link ratios to be ranked for each pair of development periods, but when using the package, only the first column gets ranked because it's the only one fully populated.

The following output uses the Mack 97 data set, and is an intermediate calculation from the init() function represented by the m1 variable:

Input triangle:

          12-24     24-36     36-48     48-60     60-72     72-84     84-96    96-108   108-120
1991   1.649840  1.319023  1.082332  1.146887  1.195140  1.112972  1.033261  1.002902  1.009217
1992  40.424528  1.259277  1.976649  1.292143  1.131839  0.993397  1.043431  1.033088       NaN
1993   2.636950  1.542816  1.163483  1.160709  1.185695  1.029216  1.026374       NaN       NaN
1994   2.043324  1.364431  1.348852  1.101524  1.113469  1.037726       NaN       NaN       NaN
1995   8.759158  1.655619  1.399912  1.170779  1.008669       NaN       NaN       NaN       NaN
1996   4.259749  1.815671  1.105367  1.225512       NaN       NaN       NaN       NaN       NaN
1997   7.217235  2.722886  1.124977       NaN       NaN       NaN       NaN       NaN       NaN
1998   5.142117  1.887433       NaN       NaN       NaN       NaN       NaN       NaN       NaN
1999   1.721992       NaN       NaN       NaN       NaN       NaN       NaN       NaN       NaN

Ranks:

array([[[[ 1., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 9., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 4., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 3., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 8., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 5., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 7., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 6., nan, nan, nan, nan, nan, nan, nan, nan],
         [ 2., nan, nan, nan, nan, nan, nan, nan, nan]]]])

I believe the culprit is a change to the signature of scipy.stats.rankdata(). There is a new parameter called nan_policy which specifies how NaNs should be handled if they appear in the inputs:

Prior to this change, there was no parameter, so I would assume the default way of handling the ranking was to omit the NaNs.

To Reproduce

import chainladder as cl
import pandas as pd

pd.read_csv('mack_1997.csv')

mack97 = cl.Triangle(
    data=df_xyz,
    origin='Accident Year',
    development='Calendar Year',
    columns=['Case Incurred'],
    cumulative=True
)

mack97.valuation_correlation(p_critical=.1, total=False).z_critical.values

You will see the following warning, but upon further inspection you will see that the intermediate variable m1 has an incorrect triangle of link ratio ranks:

RuntimeWarning: All-NaN slice encountered
  r, k = function_base._ureduce(a, func=_nanmedian, axis=axis, out=out,

Expected behavior

We should see all link ratio periods ranked (the ones labeled "r" in the image from the Mack 97 paper):

Adding the argument nan_policy='omit' seems to solve the problem:

m1 = xp.apply_along_axis(rankdata, 2, lr.values, nan_policy='omit') * (lr.values * 0 + 1)
m1
Out[30]: 
array([[[[ 1.,  2.,  1.,  2.,  5.,  4.,  2.,  1.,  1.],
         [ 9.,  1.,  7.,  6.,  3.,  1.,  3.,  2., nan],
         [ 4.,  4.,  4.,  3.,  4.,  2.,  1., nan, nan],
         [ 3.,  3.,  5.,  1.,  2.,  3., nan, nan, nan],
         [ 8.,  5.,  6.,  4.,  1., nan, nan, nan, nan],
         [ 5.,  6.,  2.,  5., nan, nan, nan, nan, nan],
         [ 7.,  8.,  3., nan, nan, nan, nan, nan, nan],
         [ 6.,  7., nan, nan, nan, nan, nan, nan, nan],
         [ 2., nan, nan, nan, nan, nan, nan, nan, nan]]]])

Desktop (please complete the following information):

Numpy Version 1.21.5
Pandas Version 1.5.3
Chainladder Version 0.8.14
Scipy Version 1.10.1

kennethshsu commented 1 year ago

@genedan can you share the dataset? If you have it readily available?

kennethshsu commented 1 year ago

So I figured out that the data shown in the paper, is the raa dataset. The sample calculation is shown on page 46. However, I don't think the package is following the math here, will investigate more.

@genedan if you have anything else that might be helpful to me, such as the calculation done in excel? Please post them here. :)

genedan commented 11 months ago

Hey @kennethshsu , sorry for being so late on this. I had uploaded a csv file called 'mack97' that I put in the data folder, but I just realized it's the same as raa so perhaps we can delete mack97.

The package should follow the calculations, if you could let me know where you found an inconsistency, perhaps I can help out. The image I pasted from page 45 of the paper is a bit confusing because it contains two triangles mashed together.

The triangle produced by the internal variable m1 should match up with the columns with the r_ij headers. Two ways to produce this are to either include the nan_policy='omit' argument to the xp.apply_along_axis() method or to downgrade scipy below 1.10.

kennethshsu commented 11 months ago

Yap, the dataset is totally the raa, we can remove mack97 later.

It's been a while but I think maybe there are two problems with .ValuationCorrelation().

First, is the issue that you described, related to the ranking of the factors.

Second, I think is a bug, which still exists on 0.8.18.

import chainladder as cl
import pandas as pd
xyz = cl.load_sample("xyz")
xyz["Incurred"].valuation_correlation(p_critical=.1, total=False).z_critical

This gets you an error ValueError: Shape of passed values is (1, 10), indices imply (1, 9). I started debugging the code, and that's when I found out that I don't really think the calculation in the package follows Mack's text.

Is this what you see?

kennethshsu commented 11 months ago

Here is my dev branch.

casact / chainladder-python

[BUG] ValuationCorrelation produces NaNs in ranking link ratios #444