Add function to sweep clustering parameters

AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal

Other

5 stars 14 forks source link

Add function to sweep clustering parameters #765

Closed sjspielman closed 1 week ago

sjspielman commented 1 week ago

Closes #755

This PR adds a function and tests to perform clustering across a set of parameters. Implementation details:

First, I am hardly enamored by my function and file naming. Do you agree and if so what thoughts do you have for me??
The function ends up creating a data frame with an indicator column cluster_set to indicate which round of clustering the values pertain to, and for easier spliting in the future. Note that I could also leave this as a list and remove the dplyr::bind_rows() and allow users to take this step themselves if they prefer. A list might also be preferable since this function will end up getting used in a template Rmd to perform/evaluate clustering parameters, so for a lot of that we'd have to revert to a list anyways.
- In part because of the need to keep these columns consistent, and also because it felt overkill/beyond the scope of this function's specific purpose, I do not allow users to vary the algorithm. You get to pick 1 algorithm and vary parameters for that algorithm.
I ended up adding a threads argument, since this can get time consuming with real SCEs
- Speaking of real SCEs, how long does it take to extract a matrix from an SCE or Seurat object (a real big one, not simulated!)? It's essentially instantaneous. I still extract the matrix to begin with to save some time, but it's small potatoes either way.
Any other tests we'd like to see?

sjspielman commented 1 week ago

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

objective_function will apply to only leiden since it's algorithm-specific

but nn would apply to both louvain and leiden.

sweep_clusters(
sce, 
algorithm = c("louvain", "leiden"), 
objective_function = "modularity", 
nn = c(10, 15)
)

Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?

sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)

jashapiro commented 1 week ago

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

objective_function will apply to only leiden since it's algorithm-specific

but nn would apply to both louvain and leiden.
sweep_clusters(
  sce, 
  algorithm = c("louvain", "leiden"), 
  objective_function = "modularity", 
  nn = c(10, 15)
)
Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?
sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)

I would do the former, but then handle it in the function withmutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_) after the expand_grid

sjspielman commented 1 week ago

I would do the former, but then handle it in the function with mutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_)

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄

jashapiro commented 1 week ago

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄

The reason we can't just let calculate_clusters ignore it is the multiple runs problem. The mutate function should be followed by a distinct to remove repeats.

sjspielman commented 1 week ago

This is getting closer, but is definitely not there yet. Some updates:

While NA makes sense for avoiding duplicate runs when irrelevant params are varied, match.args doesn't want it. So, instead of NA, I used the values that represent the defaults for those parameters, which maybe seemed like a decent middle ground (maybe? decent? middle? i'm not hedging, you're hedging!). Did you have a different thought there?
Note that I caught a bug where we can't be adding in cluster_args to the final data frame if it's empty, so I fixed (and tested) here https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/765/commits/0e47c817ef0ecdbf1e5a21ab1586183881365e93
Speaking of cluster_args - this is a problem! We don't check that these are sane for the given algorithm, and I tend to think we indeed should not be in the business of doing so since it depends on igraph. That said, bluster will fail if an irrelevant parameter is passed in, and that's very much a possibility with the sweep function. Here's what I've thought of (noting I prefer the latter) -
- We could cancel cluster_args for the sweep function
- We could have users supply this argument a nested list per algorithm, e.g. cluster_args = list(louvain = ..., walktrap = ....)).

jashapiro commented 1 week ago

Speaking of cluster_args - this is a problem! We don't check that these are sane for the given algorithm, and I tend to think we indeed should not be in the business of doing so since it depends on igraph. That said, bluster will fail if an irrelevant parameter is passed in, and that's very much a possibility with the sweep function. Here's what I've thought of (noting I prefer the latter) -

We could cancel cluster_args for the sweep function

We could have users supply this argument a nested list per algorithm, e.g. cluster_args = list(louvain = ..., walktrap = ....)).

I would vote for no cluster_args in the sweep function. If you really need it, you can write your own sweep. And if there are particular cluster_args worth adding, we can add support as needed.