bbolker / broom.mixed

tidy methods for mixed models in R
229 stars 23 forks source link

storing fitted objects for examples and tests #4

Closed bbolker closed 6 years ago

bbolker commented 6 years ago

It will be useful to store fitted models from a wide variety of packages so that they can be retrieved for examples and testing. This has two major benefits: (1) examples and tests can be run without packages installed; (2) saves lots of time in examples/testing (because fits don't actually have to be computed).

The major downside to this is that we therefore won't know when there are changes in the upstream packages that cause problems.

The way to resolve this is probably to set up an automated system for updating all of the packaged fits, and to do this frequently ...

bbolker commented 6 years ago

this is partly resolved in ea2f64b12e6de2b17c6f5fc9fef5a660900683dc

paul-buerkner commented 6 years ago

For brms models, I think storing a fitted model is the only way to test it, because compiling Stan models during testing takes a lot of time and used to fail on windows machines in the past.

dmenne commented 6 years ago

"It should not be used for other data files needed by the package, and the convention has grown up to use directory inst/extdata for such files."

Should we follow the suggestion or follow broom?

dmenne commented 6 years ago

@paul-buerkner I agree, but I am a bit afraid that storing models could lead to version problems. Is there a canonical method to force an update of the compiled models before push/CRAN?

paul-buerkner commented 6 years ago

In brms, I store such example models under /R/sysdata.rda. Before each release, I recompute all the models using a script within /tests that is listed in .Rbuildignore and thus not part of the release itself. See https://github.com/paul-buerkner/brms/blob/master/tests/brmsfit_examples.R.

Not sure if this is the most reasonable approach, but it has been working for me quite well for a long time.

dmenne commented 6 years ago

Sounds reasonable. Do you manually run that script, or does it run automatically when in /tests?

dmenne commented 6 years ago

devtools::use_data << thanks for pointing me to that one

paul-buerkner commented 6 years ago

I run it manually.

Am 09.01.2018 10:38 schrieb "Dieter Menne" notifications@github.com:

devtools::use_data << thanks for pointing me to that one

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bbolker/broom.mixed/issues/4#issuecomment-356231088, or mute the thread https://github.com/notifications/unsubscribe-auth/AMVtABhwND7R5Vb5mB4c6eSOfrvBZgx7ks5tIzONgaJpZM4RPfOJ .

paul-buerkner commented 6 years ago

Basically it does not matter where the script lives.

Am 09.01.2018 10:47 schrieb "Paul Buerkner" paul.buerkner@gmail.com:

I run it manually.

Am 09.01.2018 10:38 schrieb "Dieter Menne" notifications@github.com:

devtools::use_data << thanks for pointing me to that one

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bbolker/broom.mixed/issues/4#issuecomment-356231088, or mute the thread https://github.com/notifications/unsubscribe-auth/AMVtABhwND7R5Vb5mB4c6eSOfrvBZgx7ks5tIzONgaJpZM4RPfOJ .

dmenne commented 6 years ago

@bbolker If we follow Paul's suggestion - and Pope Hadley's preset - data would be stored in /data (yet another location), and in rda (not RDS as you are using). I am working on the mcmc-examples, where using pre-stored data really important to keep CRAN happy.

Correction: with internal = TRUE, would be stored in R/sysdata.

paul-buerkner commented 6 years ago

Please note the internal = TRUE argument in use_data storing data in R/sysdata.Rda. these objects are not exported but only live in the packages internal namespace. I don't think this causes any harm.

Am 09.01.2018 11:40 schrieb "Dieter Menne" notifications@github.com:

Cancel that. Data in /data written with use_data are data()-accessible, but we do not want to polute namespace with test data. Having a separate script to generate the test-data nevertheless sounds good

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bbolker/broom.mixed/issues/4#issuecomment-356247236, or mute the thread https://github.com/notifications/unsubscribe-auth/AMVtABWnCKPa5EaqphY467iaogSB--0wks5tI0IygaJpZM4RPfOJ .

bbolker commented 6 years ago

I do already have the beginnings of such a script here. I'm happy to switch it over to rda instead of rds. Thinking about the merits of different places to store it ... I think I prefer inst/extdata (it's in inst/example_data right now ...)

dmenne commented 6 years ago

Ok, I will follow this pattern.

dmenne commented 6 years ago

Apologies for me going in circles. When Paul mentionend it, I first found use_data a clever method, and following it rda should have been the method of choice. Checking again, rds is a better choice because you can assign it. Nevertheless, switch to extdata as directory name.

paul-buerkner commented 6 years ago

Just to clarify what use_data(..., internal = TRUE) does so that we are on the same page: it writes all model objects into one .Rda file in R/ called sysdata.Rda. The R language understands this file and makes the contained objects available automatically within the packge. That is, if we store for example fit_test in sysdata.Rda, we can access it within the package or its tests simply by calling the object's name fit_test, which I think is pretty convenient. I don't want to argue that this is necessarily better than many separate .RDS files, but at least it uses a build in functionality of R itself without much additional stuff to do around it.

dmenne commented 6 years ago

It means that we have to take care not to use the same name twice in the samples. Fair enough. I will create a branch using this method as an alternative.

bbolker commented 6 years ago

if we do this we would need a naming convention such as lme4_fit_test, brms_fit_test, etc.

bbolker commented 6 years ago

I'm going to close this. We seem to have a reasonable storage protocol now (inst/extdata/<modeltype>_example.rd[as]); we might want to be a little more careful about whether <modeltype> is the class of the object or the package name (at the moment we have brmsfit (not brms) and lme4 (not glmer), which is inconsistent), but it seems OK.