multi-block selection in recipes

topepo commented 3 months ago

In useR, @abichat mentioned difficulty with recipes, I believe is related to ways to select groups of columns of different classes of analytes.

Can you describe the problem a little more so that I and @EmilHvitfeldt can think about it.

abichat commented 3 months ago

Thanks for opening this issue!

This is about multi-blocks, multi-views or multi-omics datasets, where the explanatory matrix X=[X_1, ..., X_K] is the horizontal concatenation of K blocks of variables X_k of different length (but same number of row) (notations from Mangamana et al., 2019).

Let's begin with a small use case. Imagine that you want to perform a PCA on a multi block dataset (where each variable is already reduced-centred). You could (should?) normalise each block X_k to make them comparable. This could be done by dividing each variable by either the length of its block, or the first eigenvalue of the block. And then you could apply a classical step_pca() on the concatenated and preprocessed matrix X.

Here are various examples of variables and blocks specifications for multi-omics functions in an unsupervised setting:

MFA for dimension reduction

MFA() takes base, the concatenated matrix X, and group, the number of variables in each group. It could also take name.group to name blocks, and type to specify whether each block is categorical or quantitative.

RGCCA for dimension reduction

rgcca() takes blocks, a (named) list of the X_k. It could also take connection, a matrix which specify the relationships within blocks.

iCluterPlus for clustering

iClusterPlus() takes up to 4 blocks with its dt1, dt2, dt3 and dt4 arguments.

In my opinion, none of these specifications are tidyselect-friendly, because they require to know in advance and specify the name of the variables in each block or the length of each block.

One option (and it's a personal point of view that need to be challenged) is to allow the ... in step functions to accept a named list of selectors, like this:

step_xxx(rec, list(expression = starts_with("expr_"),   # ~20k continuous variables
                   mutations  = starts_with("mut_"),    # ~20k binary variables
                   proteomic  = starts_with("prot_"))   # ~30k continuous variables

Although this kind of data emerges naturally in the omics field, it is not restricted to it, and maybe this development could be done in a dedicated recipes extension.

Thanks!

EmilHvitfeldt commented 3 months ago

alright, i have spent some time on this. I agree that it is a worthwhile type of data, and that we should have a way of supporting it.

I don't think this is very tidyselect-like, but you could techinally support your proposed syntax if you are willing to play around with some expressions

multi_eval_select <- function(x, data, needed = c("expression", "mutations", "proteomic")) {
  x <- rlang::expr({{x}})
  x <- rlang::quo_get_expr(x)

  if (x[[1]] != quote(list)) {
    cli::cli_abort("{.arg x} must be a list.")
  }

  list_names <- names(x)

  matches <- match(needed, list_names)

  if (any(is.na(matches))) {
    cli::cli_abort("{needed[is.na(matches)]} is missing from {.arg x}.")
  }

  out <- vector("list", length = length(needed))
  names(out) <- list_names[matches]

  for (i in seq_along(needed)) {
    match <- matches[i]
    out[[i]] <- tidyselect::eval_select(x[[match]], data)
  }

  out
}

mtcars$expr_1 <- 3 
mtcars$expr_2 <- 3 
mtcars$expr_3 <- 3 

library(tidyselect)

multi_eval_select(
  list(expression = starts_with("expr_"),
       mutations  = starts_with("d"),
       proteomic  = contains("a")
  ),
  mtcars
)
#> $expression
#> expr_1 expr_2 expr_3 
#>     12     13     14 
#> 
#> $mutations
#> disp drat 
#>    3    5 
#> 
#> $proteomic
#> drat   am gear carb 
#>    5    9   10   11

but it is a little messy and is going to be hard to get right in terms of error handling in case people do the wrong this. It is also non-ideal because this is different than the way all other recipes steps works.

I would suggest that you use something akind to the following syntax and use recipes_eval_select() for each argument.

step_xxx(rec, 
                 starts_with("expr_"),   # ~20k continuous variables, through `...`
                 mutations  = starts_with("mut_"),    # ~20k binary variables
                 proteomic  = starts_with("prot_"))   # ~30k continuous variables

abichat / scimo

multi-block selection in recipes #5