carpentries-incubator / targets-workshop

Pre-alpha {targets} workshop
https://carpentries-incubator.github.io/targets-workshop/
Other
33 stars 6 forks source link

Figure out a way to avoid `models[[1]]` #43

Open joelnitta opened 2 months ago

joelnitta commented 2 months ago

maybe we could use a dataframe for storing modes instead?

joelnitta commented 2 months ago

This is tricky, because non-standard evaluation is used to define models for lm(). We can't just provide a character vector of model specifications like "bill_depth_mm ~ bill_length_mm" and map over those (well, we could with another custom function, but that is asking a lot of the learners).

Furthermore, the design of branching in {targets} nudges us to use dataframes (or tibbles) as targets. So when designing custom functions that will be used in branching, it helps to think of how the function will work on one row of input. We can write a custom function that looks clean in the final plan and produces clean output (a tidy dataframe), but the contents of the function are rather complicated since it has to work with a one-row dataframe as input. This will be tedious to explain to novices (and it still requires indexing with [[ anyways).

Finally, the approach of including models as a list-column in a dataframe is a rather advanced topic.

Anyways here is a sketch that builds models in a tibble, then branches over the rows of the tibble:

source("R/packages.R")
source("R/functions.R")

summarize_model <- function(model_tibble) {
  model_name <- model_tibble$model_name
  model <- model_tibble$model[[1]]
  glance(model) |>
    mutate(model_name = model_name) |>
    relocate(model_name, .before = 1)
}

tar_plan(
  # Load raw data
  tar_file_read(
    penguins_data_raw,
    path_to_file("penguins_raw.csv"),
    read_csv(!!.x, show_col_types = FALSE)
  ),
  # Clean data
  penguins_data = clean_penguin_data(penguins_data_raw),
  # Build models
  models = tibble(
    model_name = c("combined_model", "species_model", "interaction_model"),
    model = list(
      lm(bill_depth_mm ~ bill_length_mm, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
    )
  ),
  # Get model summaries
  tar_target(
    model_summaries,
    summarize_model(models),
    pattern = map(models)
  )
)

I now realize a more natural introduction to branching would be to branch over different sets of input instead of different models.

@multimeric keen to hear your thoughts!