Open goduckie opened 3 years ago
You are correct in recognizing that within the library 0
and nan
are treated the same - specifically that all zeros are coerced to nan
. You're also correct that in general 0
and nan
are not the same thing and under certain situations having a 0
distinct from nan
actually matters.
However, those situations (at least in my experience) are rarer than the benefit of having the two coerced to a common value. Take for instance a larger triangle. When I say larger, I'm refering to a triangle with a high cardinality in its index
or values
:
import chainladder as cl
prism = cl.load_sample('prism')
prism.values
Such a triangle is inherently sparse. This one in particular is at a claim level and for any single claim will only have values for one origin period. Using a sparse array behind the scenes allows us to cram this triangle, equivalent of 130K+ unique triangles, into a memory footprint of 4.6Mb. Pretty amazing! Unfortunately, all sparse representations of data allow for one and only one fill_value
, so my choices are:
nan
and 0
which will severly impede the memory benefits obtainable on sparse large triangles nan
to support substantially larger throughput of data through chainladder
.This is not to dismiss your suggestion because I think you are correct in your assertion. But there is a trade-off in doing so, and it is one that I think would relegate the library to being useful on small/toy datasets only.
@jbogaardt, whether this situation is rare depends very much on what type of risk you are analyzing. For those working with aggregated triangles on excess lines / volatile lines this is not an unusual feature of the dataset, and certainly one would hope valid data is not discarded / overwritten.
Also, do not think it needs to be, isn't an additional choice to:
NaN
. Right of the diagonal, the current NaN
would be maintained as is current.Wouldn't this address the issue, whilst not limiting the performance of the package for large (but sparse) datasets?
In regards to "Such a triangle is inherently sparse", isn't this only because each claim is 'broadcast' to a shape that is consistent across the dataset? I'm guessing this is core to the package, but why is it necessary to do this? If a particular claim has an origin of 2010 why is it necessary to add empty origins after this?
To an end-user, all this can be abstracted away in the simple way you're describing. Easier said than done. The code base is built on the assumption that zeros are coerced to nans. This assumption is relied on throughout the code base. The same is true for the assumption of a common grain. So the amount of rewrite and testing that would need to go into this would need a lot of time/energy. Maybe that's worth it if there are major obstacles, but I'm not seeing it yet. Help me understand the case you're running into where coercing 0
to nan
causes a real problem and not just a theoretical one.
The real, practical problem is that when multiple triangles are loaded into a triangle class, the end user has no way of telling from the cl.Triangle
class:
This is because both the shape of the original data is standardized AND zeros are coerced to np.nan
in the triangle class. Combined this manipulation results in loss of information.
An alternative is to create separate triangle classes for each original triangle, but initializing separate triangle classes is quite a lot slower, and other issues persist e.g. tri.loc['b'].to_frame(keepdims=True)
the nan
cells are not returned (so original data cannot not be extracted / reconstituted)
Example loss of info below:
import chainladder as cl
import pandas as pd
raa = cl.load_sample('raa')
# create lob a
df_a = raa.dev_to_val().to_frame(keepdims=True).reset_index()
df_a['Total'] = 'a'
# create lob b
df_b = df_a.copy()
df_b['Total'] = 'b'
# say, lob B started in 1984
df_b = df_b.loc[df_b['origin'] > '1984-01-01']
# lob b is volatile line with some loss free years, particularly at start owing to lower volume
zero_years = df_b['origin'].isin(['1985-01-01','1986-01-01','1989-01-01'])
df_b.loc[zero_years,'values'] = 0
#combine a & b
df = pd.concat([df_a, df_b])
tri= cl.Triangle(data=df, origin='origin', development='valuation', columns='values', index='Total', cumulative=True)
# cannot tell when this LOB started, or that first two years were loss free
tri.loc['b']
Thank you. This is an extremely clear example and it is evident that you're getting knowledgeable of the library nuances. But is this something that really needs a change in the code base or is it just part of the actuary's workflow?
In terms of estimating unpaid claims, the projections would be unaffected. Under a Chainladder
approach, the ultimate_
for those claim free years wil be nan
which can be considered 0
. If you are going to apply an expected loss method like BonhuetterFerguson
or CapeCod
, you will have an exposure vector that would tell you that those loss free years have some exposure and your ultimate_
would be non-zero.
Alternatively, If you really want to hack things, you can just replace the zeros in your data with some value that is effectively zero. For example:
df_b.loc[zero_years,'values'] = 1e-100
Doing this will restore the "lost" information, while being no less accurate than the floating point precision errors inherent in python in general. This hack can probably be added to the Triangle
constructor as an extra option, maybe with coerce_zero_to_nan=False
argument. It would then not need a wholesale rewrite of the library.
Could you point me to the code base where this becomes a problem? I'm struggling to understand why 0.0
should be problematic vs any other float. Is it div zero style problems?
Picking up on some points: "is this something that really needs a change in the code base or is it just part of the actuary's workflow?" it certainly causes workflow issues, but in itself, this is a data handling problem.
" for those claim free years will be nan which can be considered 0"
this is the whole issue / aim of the example, in some cases it is not possible to know which years are loss free and which ones are missing. As you point out this is true for Chainladder
, I've also come across this when extracting the data from the triangle class using .to_frame
the original triangle is lost, latest_diagonal
also becomes ambiguous.
"Doing this will restore the "lost" information" Strictly, it introduces false information as well as new issues, for one age_to_age factors blow up....
The main issue is the sparse support I mention in my first comment. Anywhere in the code base you see num_to_nan
or nan_to_num
, we're invoking the assumption that nan == 0
. Its everywhere, and there is 0
chance...wait, there is nan
chance (see what I did there 😄) that I will personally prioritize this issue. There are so many ways the library functionality can be expanded for the better, but this one gives a very low return on time investment. Again, you're right in your assessment of the issue, but I just value my time too much to do anything about it, let alone continue arguing about it. All this to say - you are welcome to make a PR to get the functionality you want without creating issues elsewhere. If you have no intention of making a PR, then I will close this issue as out of scope.
I see what has been done, the great excel tradition of confusing zero's and nans has persevered :persevere:
Appreciate the degree to which this matters to you is dependent on your use case, its just unfortunate that this is imbedded in the core data structure.
Please keep open for a while, I'll see if I can take a look.
We are running into the same problem here, and have been using the df_b.loc[zero_years,'values'] = 1e-100
alternative for now. This fixes the issue for chainladder, but now results in very large (close to infinite) Mack Chainladder variances as the parameter error divides by 1e-100
. @goduckie have you done any further work on a workaround for the NaN
vs 0
issue?
Legitimate zero entries in triangle data (e.g. an origin period had zero loss experience ) are converted to
NaN
, these entries should remain as zeros, as a zero is distinct & different from missing.This conversion to
NaN
causes subsequent issues (see below) as well as others e.g. application of Chainladder projection method results in aNaN
ultimate rather than 'zero' for that origin period.Results in following, would expect to see '0' in the first three entries of 1988 origin period
The 1988 entries are removed from the triangle data: