cl.Triangle legitimate zeros converted to NaN

casact / chainladder-python

Actuarial reserving in Python

https://chainladder-python.readthedocs.io/en/latest/

Mozilla Public License 2.0

192 stars 71 forks source link

cl.Triangle legitimate zeros converted to NaN #181

Open goduckie opened 3 years ago

goduckie commented 3 years ago

Legitimate zero entries in triangle data (e.g. an origin period had zero loss experience ) are converted to NaN, these entries should remain as zeros, as a zero is distinct & different from missing.

This conversion to NaN causes subsequent issues (see below) as well as others e.g. application of Chainladder projection method results in a NaN ultimate rather than 'zero' for that origin period.

import chainladder as cl
raa = cl.load_sample('raa')
df = raa.dev_to_val().to_frame(keepdims=True)

# set 1988 origin period to zero, e.g. a volatile class with zero loss experience in that year
df.loc[df['origin']=='1988-01-01', 'values'] = 0

raa_adj =cl.Triangle(data=df, origin='origin', development='valuation', columns='values', cumulative=True)

Results in following, would expect to see '0' in the first three entries of 1988 origin period

raa_adj 

         12       24       36       48       60       72       84       96       108      120
1981  5012.0   8269.0  10907.0  11805.0  13539.0  16181.0  18009.0  18608.0  18662.0  18834.0
1982   106.0   4285.0   5396.0  10666.0  13782.0  15599.0  15496.0  16169.0  16704.0      NaN
1983  3410.0   8992.0  13873.0  16141.0  18735.0  22214.0  22863.0  23466.0      NaN      NaN
1984  5655.0  11555.0  15766.0  21266.0  23425.0  26083.0  27067.0      NaN      NaN      NaN
1985  1092.0   9565.0  15836.0  22169.0  25955.0  26180.0      NaN      NaN      NaN      NaN
1986  1513.0   6445.0  11702.0  12935.0  15852.0      NaN      NaN      NaN      NaN      NaN
1987   557.0   4020.0  10946.0  12314.0      NaN      NaN      NaN      NaN      NaN      NaN
1988     NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
1989  3133.0   5395.0      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
1990  2063.0      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN

The 1988 entries are removed from the triangle data:

df_adj = raa_adj.to_frame(keepdims=True)
df_adj.loc[df_adj['origin']=='1988-01-01']

Empty DataFrame
Columns: [origin, development, values]
Index: []

jbogaardt commented 3 years ago

You are correct in recognizing that within the library 0 and nan are treated the same - specifically that all zeros are coerced to nan. You're also correct that in general 0 and nan are not the same thing and under certain situations having a 0 distinct from nan actually matters.

However, those situations (at least in my experience) are rarer than the benefit of having the two coerced to a common value. Take for instance a larger triangle. When I say larger, I'm refering to a triangle with a high cardinality in its index or values:

import chainladder as cl
prism = cl.load_sample('prism')
prism.values

Such a triangle is inherently sparse. This one in particular is at a claim level and for any single claim will only have values for one origin period. Using a sparse array behind the scenes allows us to cram this triangle, equivalent of 130K+ unique triangles, into a memory footprint of 4.6Mb. Pretty amazing! Unfortunately, all sparse representations of data allow for one and only one fill_value, so my choices are:

Support both nan and 0 which will severly impede the memory benefits obtainable on sparse large triangles
Coerce them to be a common value - in this case nan to support substantially larger throughput of data through chainladder.

This is not to dismiss your suggestion because I think you are correct in your assertion. But there is a trade-off in doing so, and it is one that I think would relegate the library to being useful on small/toy datasets only.

goduckie commented 3 years ago

@jbogaardt, whether this situation is rare depends very much on what type of risk you are analyzing. For those working with aggregated triangles on excess lines / volatile lines this is not an unusual feature of the dataset, and certainly one would hope valid data is not discarded / overwritten.

Also, do not think it needs to be, isn't an additional choice to:

Add an argument to the Triangle class to permit the user to specify whether zeros left of the diagonal (i.e. before the valuation date) should be overwritten with NaN. Right of the diagonal, the current NaN would be maintained as is current.

Wouldn't this address the issue, whilst not limiting the performance of the package for large (but sparse) datasets?

In regards to "Such a triangle is inherently sparse", isn't this only because each claim is 'broadcast' to a shape that is consistent across the dataset? I'm guessing this is core to the package, but why is it necessary to do this? If a particular claim has an origin of 2010 why is it necessary to add empty origins after this?

jbogaardt commented 3 years ago

To an end-user, all this can be abstracted away in the simple way you're describing. Easier said than done. The code base is built on the assumption that zeros are coerced to nans. This assumption is relied on throughout the code base. The same is true for the assumption of a common grain. So the amount of rewrite and testing that would need to go into this would need a lot of time/energy. Maybe that's worth it if there are major obstacles, but I'm not seeing it yet. Help me understand the case you're running into where coercing 0 to nan causes a real problem and not just a theoretical one.

goduckie commented 3 years ago

The real, practical problem is that when multiple triangles are loaded into a triangle class, the end user has no way of telling from the cl.Triangle class:

when a particular triangle started (or potentially finished)
whether a given year had zero loss or whether it is missing
whether valuation stops or the loss reduced to zero

This is because both the shape of the original data is standardized AND zeros are coerced to np.nan in the triangle class. Combined this manipulation results in loss of information.

An alternative is to create separate triangle classes for each original triangle, but initializing separate triangle classes is quite a lot slower, and other issues persist e.g. tri.loc['b'].to_frame(keepdims=True) the nan cells are not returned (so original data cannot not be extracted / reconstituted)

Example loss of info below:

import chainladder as cl
import pandas as pd

raa = cl.load_sample('raa')

# create lob a
df_a = raa.dev_to_val().to_frame(keepdims=True).reset_index()
df_a['Total'] = 'a'

# create lob b
df_b = df_a.copy()
df_b['Total'] = 'b'

# say, lob B started in 1984
df_b = df_b.loc[df_b['origin'] > '1984-01-01']

# lob b is volatile line with some loss free years, particularly at start owing to lower volume
zero_years = df_b['origin'].isin(['1985-01-01','1986-01-01','1989-01-01'])
df_b.loc[zero_years,'values'] = 0

#combine a & b
df = pd.concat([df_a, df_b])

tri= cl.Triangle(data=df, origin='origin', development='valuation', columns='values', index='Total', cumulative=True)

# cannot tell when this LOB started, or that first two years were loss free
tri.loc['b']

jbogaardt commented 3 years ago

Thank you. This is an extremely clear example and it is evident that you're getting knowledgeable of the library nuances. But is this something that really needs a change in the code base or is it just part of the actuary's workflow?

In terms of estimating unpaid claims, the projections would be unaffected. Under a Chainladder approach, the ultimate_ for those claim free years wil be nan which can be considered 0. If you are going to apply an expected loss method like BonhuetterFerguson or CapeCod, you will have an exposure vector that would tell you that those loss free years have some exposure and your ultimate_ would be non-zero.

Alternatively, If you really want to hack things, you can just replace the zeros in your data with some value that is effectively zero. For example:

df_b.loc[zero_years,'values'] = 1e-100

Doing this will restore the "lost" information, while being no less accurate than the floating point precision errors inherent in python in general. This hack can probably be added to the Triangle constructor as an extra option, maybe with coerce_zero_to_nan=False argument. It would then not need a wholesale rewrite of the library.

goduckie commented 3 years ago

Could you point me to the code base where this becomes a problem? I'm struggling to understand why 0.0 should be problematic vs any other float. Is it div zero style problems?

Picking up on some points: "is this something that really needs a change in the code base or is it just part of the actuary's workflow?" it certainly causes workflow issues, but in itself, this is a data handling problem.

" for those claim free years will be nan which can be considered 0" this is the whole issue / aim of the example, in some cases it is not possible to know which years are loss free and which ones are missing. As you point out this is true for Chainladder, I've also come across this when extracting the data from the triangle class using .to_frame the original triangle is lost, latest_diagonal also becomes ambiguous.

"Doing this will restore the "lost" information" Strictly, it introduces false information as well as new issues, for one age_to_age factors blow up....

jbogaardt commented 3 years ago

The main issue is the sparse support I mention in my first comment. Anywhere in the code base you see num_to_nan or nan_to_num, we're invoking the assumption that nan == 0. Its everywhere, and there is 0 chance...wait, there is nan chance (see what I did there 😄) that I will personally prioritize this issue. There are so many ways the library functionality can be expanded for the better, but this one gives a very low return on time investment. Again, you're right in your assessment of the issue, but I just value my time too much to do anything about it, let alone continue arguing about it. All this to say - you are welcome to make a PR to get the functionality you want without creating issues elsewhere. If you have no intention of making a PR, then I will close this issue as out of scope.

goduckie commented 3 years ago

I see what has been done, the great excel tradition of confusing zero's and nans has persevered :persevere:

Appreciate the degree to which this matters to you is dependent on your use case, its just unfortunate that this is imbedded in the core data structure.

Please keep open for a while, I'll see if I can take a look.

malankriel commented 1 year ago

We are running into the same problem here, and have been using the df_b.loc[zero_years,'values'] = 1e-100 alternative for now. This fixes the issue for chainladder, but now results in very large (close to infinite) Mack Chainladder variances as the parameter error divides by 1e-100. @goduckie have you done any further work on a workaround for the NaN vs 0 issue?