femiguez / nlraa

Nonlinear Regression for Agricultural Applications
https://femiguez.github.io/nlraa-docs/index.html
22 stars 2 forks source link

Need data argument for bootstrapping #7

Closed billdenney closed 3 years ago

billdenney commented 3 years ago

Similar to #4, can a data argument be added to boot_nlme, boot_lme, and boot_nls?

Here is a reprex:

require(nlme)
#> Loading required package: nlme

fit_the_model <- function() {
  data(barley, package = "nlraa")
  barley2 <- subset(barley, year < 1974)
  fit.lp.gnls2 <- gnls(yield ~ linp(NF, a, b, xs), start=c(a=170, b=100, xs=5), data = barley2)
  barley2$year.f <- as.factor(barley2$year)
  cfs <- coef(fit.lp.gnls2)
  fit.lp.gnls3_inner <-
    update(fit.lp.gnls2,
           params = list(a + b + xs ~ year.f),
           start = c(cfs[1], 0, 0, 0, 
                     cfs[2], 0, 0, 0,
                     cfs[3], 0, 0, 0))
  fit.lp.gnls3_inner
}

fit.lp.gnls3 <- fit_the_model()
#> Error in linp(NF, a, b, xs): could not find function "linp"

sims <- simulate_nlme(fit.lp.gnls3, nsim = 3)
#> Error in simulate_nlme(fit.lp.gnls3, nsim = 3): could not find function "simulate_nlme"
boot <- boot_nlme(fit.lp.gnls3, nsim = 3)
#> Error in boot_nlme(fit.lp.gnls3, nsim = 3): could not find function "boot_nlme"

barley3 <- subset(barley, year < 1974)
barley3$year.f <- as.factor(barley3$year)

boot <- boot_nlme(fit.lp.gnls3, nsim = 3, data=barley3)
#> Error in boot_nlme(fit.lp.gnls3, nsim = 3, data = barley3): could not find function "boot_nlme"

Created on 2021-08-01 by the reprex package (v2.0.0)

femiguez commented 3 years ago

Thanks for reporting this! boot* will indeed fail if the data is not in scope. I'll add the data argument as I did for simulate*

femiguez commented 3 years ago

@billdenney notice that the data argument for boot_* will not solve the issue for your particular example. Objects created within a function only live inside that function's environment. For example:

f <- function() x <- 3
x
## Error: object 'x' not found

object barley2 only exists in the environment of fit_the_model. There are many ways of addressing this. The first one is normally discouraged (because of potential side effects), but I have used it occasionally if I think it won't cause problems. It involves using the <<- operator.

Option 1:

library(nlraa)
library(nlme)

fit_the_model <- function() {
  data(barley, package = "nlraa")
  barley2 <- subset(barley, year < 1974)
  fit.lp.gnls2 <- gnls(yield ~ linp(NF, a, b, xs), start=c(a=170, b=100, xs=5), data = barley2)
  barley2$year.f <- as.factor(barley2$year)
  barley2 <<- barley2 ## barley2 is now available in the parent environment
  cfs <- coef(fit.lp.gnls2)
  fit.lp.gnls3_inner <-
    update(fit.lp.gnls2,
           params = list(a + b + xs ~ year.f),
           start = c(cfs[1], 0, 0, 0, 
                     cfs[2], 0, 0, 0,
                     cfs[3], 0, 0, 0))
  fit.lp.gnls3_inner
}

fit.lp.gnls3 <- fit_the_model()

sims <- simulate_nlme(fit.lp.gnls3, nsim = 3, data = barley2)

I realize that you might be writing a function which creates or modifies data and then fits a model. In that case, I think it is better to return a list with the fitted model and the new/modified data (and other required objects). This, to me, is a better approach. Now, it is not necessary to pass the data to simulate or boot because it is available in the global environment, so that nlme::getData can find it.

Option 2:

fit_the_model <- function() {
  data(barley, package = "nlraa")
  barley2 <- subset(barley, year < 1974)
  fit.lp.gnls2 <- gnls(yield ~ linp(NF, a, b, xs), start=c(a=170, b=100, xs=5), data = barley2)
  barley2$year.f <- as.factor(barley2$year)
  cfs <- coef(fit.lp.gnls2)
  fit.lp.gnls3_inner <-
    update(fit.lp.gnls2,
           params = list(a + b + xs ~ year.f),
           start = c(cfs[1], 0, 0, 0, 
                     cfs[2], 0, 0, 0,
                     cfs[3], 0, 0, 0))
  ## cfs also needs to be available to re-fit the model
  ans <- list(model = fit.lp.gnls3_inner, data = barley2, cfs = cfs)
  ans
}

res <- fit_the_model()
fit.lp.gnls3 <- res$model
barley2 <- res$data
cfs <- res$cfs

sims <- simulate_nlme(fit.lp.gnls3, nsim = 3)

fit.lp.gnls3.bt <- boot_nlme(fit.lp.gnls3, R = 100)
billdenney commented 3 years ago

Sorry for the bad reprex. I was running through the process too quickly. And, thank you for the suggestion.

My goal is similar to option 2 in your example, but my code is a good bit more complex. (That's not the biggest deal in the world, but it does add complexity to everything that has to be kept track of which is what I was hoping to minimize.)

In option 2, the user has to remember to set barley2 <- res$data, and the barley2 variable must be set in the global environment. My use case is bootstrapping is happening within a function and within a drake workflow which restricts the user from assigning to the global environment via <<- due to reproducibility issues. In this better-defined scenario, I hope that a data argument to the boot_* functions could fix the issue.

So, a better reprex for my use case is:

library(nlme)
library(nlraa)

fit_the_model <- function() {
  data(barley, package = "nlraa")
  barley2 <- subset(barley, year < 1974)
  fit.lp.gnls2 <- gnls(yield ~ linp(NF, a, b, xs), start=c(a=170, b=100, xs=5), data = barley2)
  barley2$year.f <- as.factor(barley2$year)
  cfs <- coef(fit.lp.gnls2)
  fit.lp.gnls3_inner <-
    update(fit.lp.gnls2,
           params = list(a + b + xs ~ year.f),
           start = c(cfs[1], 0, 0, 0, 
                     cfs[2], 0, 0, 0,
                     cfs[3], 0, 0, 0))
  ## cfs also needs to be available to re-fit the model
  ans <- list(model = fit.lp.gnls3_inner, data = barley2, cfs = cfs)
  ans
}

bootstrap_the_model <- function(model_with_data) {
  fit.lp.gnls3 <- model_with_data$model
  barley2 <- model_with_data$data
  cfs <- model_with_data$cfs

  #sims <- simulate_nlme(fit.lp.gnls3, nsim = 3)

  fit.lp.gnls3.bt <- boot_nlme(fit.lp.gnls3, R = 100)
  fit.lp.gnls3.bt
}

res <- fit_the_model()
my_boot <- bootstrap_the_model(res)
#> Error in eval(mCall$data): object 'barley2' not found

Created on 2021-08-04 by the reprex package (v2.0.0)

femiguez commented 3 years ago

No need to apologize at all. I was trying to be as clear as possible in my explanation. The version of the package here in github now has the data argument for the boot_ functions. (If you try it let me know how it goes). However, one issue with your example is that you also need the cfs object, which cannot be found by boot_nlme (environments are nested too deep, I think). My simple fix for this is to replace the cfs object with just numbers in your example above. This is what I mean:

  fit.lp.gnls3_inner <-
    update(fit.lp.gnls2,
           params = list(a + b + xs ~ year.f),
           start = c(176, 0, 0, 0, 
                     27, 0, 0, 0,
                     6, 0, 0, 0))

If you can use the SS version of the function which can help with convergence. (SSlinp does not need starting values - this is irrelevant in the example above but it might be in your specific case).

femiguez commented 3 years ago

This is resolved in the current version of the code in github. Is it okay if I close this issue?

billdenney commented 3 years ago

Sure. Thanks for implementing it!