Figure out a way to avoid `models[[1]]`

joelnitta commented 4 months ago

maybe we could use a dataframe for storing modes instead?

joelnitta commented 4 months ago

This is tricky, because non-standard evaluation is used to define models for lm(). We can't just provide a character vector of model specifications like "bill_depth_mm ~ bill_length_mm" and map over those (well, we could with another custom function, but that is asking a lot of the learners).

Furthermore, the design of branching in {targets} nudges us to use dataframes (or tibbles) as targets. So when designing custom functions that will be used in branching, it helps to think of how the function will work on one row of input. We can write a custom function that looks clean in the final plan and produces clean output (a tidy dataframe), but the contents of the function are rather complicated since it has to work with a one-row dataframe as input. This will be tedious to explain to novices (and it still requires indexing with [[ anyways).

Finally, the approach of including models as a list-column in a dataframe is a rather advanced topic.

Anyways here is a sketch that builds models in a tibble, then branches over the rows of the tibble:

source("R/packages.R")
source("R/functions.R")

summarize_model <- function(model_tibble) {
  model_name <- model_tibble$model_name
  model <- model_tibble$model[[1]]
  glance(model) |>
    mutate(model_name = model_name) |>
    relocate(model_name, .before = 1)
}

tar_plan(
  # Load raw data
  tar_file_read(
    penguins_data_raw,
    path_to_file("penguins_raw.csv"),
    read_csv(!!.x, show_col_types = FALSE)
  ),
  # Clean data
  penguins_data = clean_penguin_data(penguins_data_raw),
  # Build models
  models = tibble(
    model_name = c("combined_model", "species_model", "interaction_model"),
    model = list(
      lm(bill_depth_mm ~ bill_length_mm, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
    )
  ),
  # Get model summaries
  tar_target(
    model_summaries,
    summarize_model(models),
    pattern = map(models)
  )
)

I now realize a more natural introduction to branching would be to branch over different sets of input instead of different models.

@multimeric keen to hear your thoughts!

joelnitta commented 2 weeks ago

Update: just taught this workshop again, and this part is very difficult to teach since the details are so complicated. We should definitely use a simpler example for branching. Maybe not even use the models at all.

joelnitta commented 2 weeks ago

NEW IDEA: instead of branching over the list of models, split up the original data set by species using tar_group(), then build a model for each separately. It will then be much easier to reason about the subsequent steps of extracting model parameters and predictions using broom::glimpse() and broom::augment(). The downside of this approach is that it is technically not statistically sound (making a separate model for each species instead of a single model that includes species as a categorical predictor variable). But the point of the workshop is to teach how to use {targets}, not statistics, so I think that's OK.

multimeric commented 2 weeks ago

I think that would be better. Anything that avoids using a list is good: even if we have something that relates to branching over a single vector would be better because it avoids changing the data type.

joelnitta commented 2 weeks ago

Right... of course, the output of lm() is a list, so that makes it awkward to include directly in the pipeline. If we want to avoid branching over lists, we would have to build the model twice, once for broom::augment() and once for broom::glance(). Something like this (assuming penguins_data is coming in as a branch split up by species):

augment_penguins <- function(penguins_data) {
  model <- lm(bill_length_mm ~ bill_depth_mm, data = penuins_data)
  augment(model) |>
    mutate(species = unique(penguins_data$species)
}

glance_penguins <- function(penguins_data) {
  model <- lm(bill_length_mm ~ bill_depth_mm, data = penuins_data)
  glance(model) |>
    mutate(species = unique(penguins_data$species)
}

That feels a little awkward because in a "production" situation you would only build the model once. But for teaching {targets} it's probably OK? It sure is easier to reason about with dataframe in and dataframe out.

carpentries-incubator / targets-workshop

Figure out a way to avoid `models[[1]]` #43