gamlss-dev / gamlss

gamlss: Generalized Additive Models for Location Scale and Shape
https://CRAN.R-project.org/package=gamlss
10 stars 4 forks source link

Anonymize fitted model #9

Closed tmspvn closed 2 months ago

tmspvn commented 2 months ago

Dear Authors,

Currently, the predict function of a fitted model requires the original training data. For my project I would like to share publicly the fitted model on my data that, however, is sensitive and not sharable. I there a way i could anonymize a fitted model? Or an alternative approach that doesn't require to share the data at all but just an instance of the model/parameters?

Thanks a lot

zeileis commented 2 months ago

Thanks for your interest. My reading of the code of the predict() method is that both the old and the new data are required for computing predictions because the model is being refit, see:

https://github.com/gamlss-dev/gamlss/blob/main/R/predict.gamlss_23_12_21.R#L88-L138

I'm not sure whether there exists another possibility in gamlss, Mikis @mstasinopoulos might be able to say more.

But we are currently working on a streamlined re-implementation of gamlss under the working title gamlss2, led by Niki @freezenik. I think the chances for anonymizing are better here. And if it is not yet possible to get an anonymized version, Niki might be able to add this feature.

What I have tried:

## fit model
library("gamlss2")
data("cars", package = "datasets")
cars_original <- cars
m <- gamlss2(dist ~ s(speed) | s(speed), data = cars_original, family = NO)

## delete original data and eliminate certain elements from the model
rm(cars_original)
for(i in c("fitted.values", "fitted.linear", "y", "x", "results")) m[[i]] <- NULL

## new data, predictions still work
cars_new <- data.frame(speed = 1:5 * 5)
predict(m, newdata = cars_new)
##          mu     sigma
## 1  5.862649  7.407651
## 2 23.039123 10.062998
## 3 40.771961 13.659545
## 4 59.036068 18.510117
## 5 77.665437 25.069072

So far, so good. The only catch is that implicitly there is still "some" information about the original data set in the $fitted.specials and $specials elements. I think that this does not include information about the original response and original regressors but only about the corresponding fitted values but maybe it is possible to clean this up as well, Niki?

tmspvn commented 2 months ago

Many thanks zeileis for the answer.

That would be a very useful feature. If I may ask, at what point is the development of gamlss2?

freezenik commented 2 months ago

If you set the argument light = TRUE, all data and fitted.values will be removed using gamlss2()

m <- gamlss2(dist ~ s(speed) | s(speed), data = cars_original, family = NO, light = TRUE)

The fitted.specials do not contain fitted.values, only coefficients and some more information.

freezenik commented 2 months ago

I think gamlss2 is already very stable. The bonus using gamlss2 is that you can use all mgcv smooth term constructors, i.e., s(), te(), ti(). I would be nice if you try it out and give us some feedback : ). To start, the best is to go to the manual ?gamlss2.

zeileis commented 2 months ago

Aaah, nice! :tada: I thought that you must have an option for that somewhere.

tmspvn commented 2 months ago

Thanks a lot everyone!

I will give a try to gamlss2 and make sure to give update!

zeileis commented 2 months ago

Great, thanks! I'm closing this issue now. Feel free to comment here nevertheless - or open a new issue in https://github.com/gamlss-dev/gamlss2/issues