DevelopmentML possible inconsistencies between implementation and documentation

cdietrich215 commented 5 months ago

Thanks for this excellent implementation of using machine learning methods on reserving data. During review I think I've noticed a couple of things in the code that may be out of step with the documentation, but I may be missing something with many of these.

Just a note that the docstring for the fit method references the Munich adjustment (I think held over from the DevelopmentBase class).
The documentation states that the fit_incrementals argument controls "whether the response variable should be converted to an incremental basis for fitting", but I don't believe y_ml is ever converted to incremental values before fitting the ML model of choice in line 182, which I feel is out of line with what the documentation states
I notice that the fit_incrementals argument also controls whether or not autoregressive values are converted to incremental before fitting. While I agree it could sometimes make since to convert these, I don't think many users would infer this is happening from the documentation. Perhaps it'd be better to pass this alongside the autoregressive parameter?
On a side note I believe this merge is odd, it seems like it converts the last column to incremental regardless of if its the response or if fit_incrementals is used

jbogaardt commented 4 months ago

Hi @cdietrich21, thanks for the feedback. This is great and your commentary makes sense. Definitely need to update the docs to not reference Munich and be a bit clearer on how it works. Being a lesser used estimator, this one hasn't received a whole lotta love. I think we're muddying what "response" means in the docs. The triangle data itself is both the independent variables and the response (lagged of course). I believe when we prep the independent variables, converting to incremental basis should impact the "response" as well. That said, your last bullet suggests that we're converting to incrementals regardless of the end-user selection, and I can confirm that the code is wrong in doing this.

I don't believe we have any test cases for non-incremental fits, but we probably should if we claim to support toggling it. I also agree it should clearer how it impacts the autoregressive parameter - which itself doesn't have test cases to ensure accuracy. That is- autoregressive may have rough edges even beyond the transparency you're looking for.

cdietrich215 commented 4 months ago

Thanks @jbogaardt. Seems to be a very useful/flexible implementation and a great feature of the package so I'm happy to help improve it.

It's possible my second and fourth bullet points are related if this is where it was attempted to convert to incremental. But I think 1) it's not enforcing that the last column is indeed the response and 2) it occurs regardless of the fit_incrementals argument.

Agreed that the response could potentially be the same as some of the independent variables (if we're modeling cumulative CumPaidLoss in the clrd dataset, it would be both the response and presumably also the most important predictor). However I think it could be good to be careful about how this is handled - for instance if we want to model incremental paid losses, then cumulative paid losses could still be an important predictor. I think this should always create a new column for the response instead of editing the old column in place (although its possible I'm misunderstanding what this model is doing).

If the user wanted to make predictor columns incremental - I think you could pretty easily update theautoregressive argument to be a more general argument that creates a new predictor column. The user could optionally add autoregression or decide if it's incremental. Then decide which predictors end up getting used in their patsy formula

jbogaardt commented 4 months ago

Seems to be a very useful/flexible implementation and a great feature of the package so I'm happy to help improve it.

You seem to have a pretty good sense for the issues and ways to improve it. I for one would absolutely love contributions on this. I think we have to just be careful that the dependent estimators (TweedieGLM, BarnettZehnwirth) which inherit from DevelopmentML continue to function as intended.

cdietrich215 commented 4 months ago

Happy to contribute on this and suggest updates to the code. Do I need to be added as a contributor to push a branch for this?

cdietrich215 commented 4 months ago

I had a chance to spend more time with this and get a better feel for what the code is doing. Going to restate my suggestions since I think I originally may have misunderstood some elements of what the code was doing. Happy to help suggest code once again

I do think fit_incrementals is not correctly operating but this should be an easy fix in the _prep_X_ml method
Autregressors are definitely broken, I think perhaps both when they get defined in _prep_X_ml and when they are used in get_triangle_ml. Though I'm still trying to still think about if they're necessary to include or not - unsure if any papers use them in this way, where we project out the auto regressor which then is used in model prediction. Part of me think this is a useful idea but also that it could potentially make model variance a little crazy.
I'd suggest using sklearn's MultiOutputRegressor to wrap the model in cases when len(y_ml)> 1

None of these would impact TweedieGLM or BarnettZehnwirth if implemented correctly

casact / chainladder-python

DevelopmentML possible inconsistencies between implementation and documentation #533