casact / chainladder-python

Actuarial reserving in Python
https://chainladder-python.readthedocs.io/en/latest/
Mozilla Public License 2.0
192 stars 71 forks source link

DevelopmentML possible inconsistencies between implementation and documentation #533

Open cdietrich215 opened 5 months ago

cdietrich215 commented 5 months ago

Thanks for this excellent implementation of using machine learning methods on reserving data. During review I think I've noticed a couple of things in the code that may be out of step with the documentation, but I may be missing something with many of these.

jbogaardt commented 4 months ago

Hi @cdietrich21, thanks for the feedback. This is great and your commentary makes sense. Definitely need to update the docs to not reference Munich and be a bit clearer on how it works. Being a lesser used estimator, this one hasn't received a whole lotta love. I think we're muddying what "response" means in the docs. The triangle data itself is both the independent variables and the response (lagged of course). I believe when we prep the independent variables, converting to incremental basis should impact the "response" as well. That said, your last bullet suggests that we're converting to incrementals regardless of the end-user selection, and I can confirm that the code is wrong in doing this.

I don't believe we have any test cases for non-incremental fits, but we probably should if we claim to support toggling it. I also agree it should clearer how it impacts the autoregressive parameter - which itself doesn't have test cases to ensure accuracy. That is- autoregressive may have rough edges even beyond the transparency you're looking for.

cdietrich215 commented 4 months ago

Thanks @jbogaardt. Seems to be a very useful/flexible implementation and a great feature of the package so I'm happy to help improve it.

It's possible my second and fourth bullet points are related if this is where it was attempted to convert to incremental. But I think 1) it's not enforcing that the last column is indeed the response and 2) it occurs regardless of the fit_incrementals argument.

Agreed that the response could potentially be the same as some of the independent variables (if we're modeling cumulative CumPaidLoss in the clrd dataset, it would be both the response and presumably also the most important predictor). However I think it could be good to be careful about how this is handled - for instance if we want to model incremental paid losses, then cumulative paid losses could still be an important predictor. I think this should always create a new column for the response instead of editing the old column in place (although its possible I'm misunderstanding what this model is doing).

If the user wanted to make predictor columns incremental - I think you could pretty easily update theautoregressive argument to be a more general argument that creates a new predictor column. The user could optionally add autoregression or decide if it's incremental. Then decide which predictors end up getting used in their patsy formula

jbogaardt commented 4 months ago

Seems to be a very useful/flexible implementation and a great feature of the package so I'm happy to help improve it.

You seem to have a pretty good sense for the issues and ways to improve it. I for one would absolutely love contributions on this. I think we have to just be careful that the dependent estimators (TweedieGLM, BarnettZehnwirth) which inherit from DevelopmentML continue to function as intended.

cdietrich215 commented 4 months ago

Happy to contribute on this and suggest updates to the code. Do I need to be added as a contributor to push a branch for this?

cdietrich215 commented 4 months ago

I had a chance to spend more time with this and get a better feel for what the code is doing. Going to restate my suggestions since I think I originally may have misunderstood some elements of what the code was doing. Happy to help suggest code once again

None of these would impact TweedieGLM or BarnettZehnwirth if implemented correctly