drsimonj / twidlr

data.frame-based API for model and predict functions
Other
59 stars 9 forks source link

Default to dropping data information if kept? #27

Open drsimonj opened 7 years ago

drsimonj commented 7 years ago

Some models literally retain copies of data frames etc in order to make predictions. This can be convenient but has at least two downsides (described below). This issue proposes that, in cases where such info is not needed, models that store data by default have this information removed from the fitted model. E.g., by default, lm should set the arg model = FALSE (and look into all model, x, y).

Downsides to default case of keeping original data.frame

  1. It creates a memory overhead. E.g., for lm:
object.size(lm(mpg ~ ., mtcars))
#> 45768 bytes
object.size(lm(mpg ~ ., mtcars, model = FALSE))
#> 28152 bytes

Given that twidlr requires a data frame for predict, if the only reason this info is retained is to call predict, then it can be dropped.

  1. It is inconsistent between models and thus misleading. For example, lm stores the original data by default making predict work properly. However, other models do not, and point to the original data frame in the global environment. E.g., see examples here. A similar thing can be done when lm is used with model = FALSE.