AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
5 stars 14 forks source link

Add function to sweep clustering parameters #765

Closed sjspielman closed 1 week ago

sjspielman commented 1 week ago

Closes #755

This PR adds a function and tests to perform clustering across a set of parameters. Implementation details:

sjspielman commented 1 week ago

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?

sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)
jashapiro commented 1 week ago

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

  • objective_function will apply to only leiden since it's algorithm-specific
  • but nn would apply to both louvain and leiden.
sweep_clusters(
  sce, 
  algorithm = c("louvain", "leiden"), 
  objective_function = "modularity", 
  nn = c(10, 15)
)

Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?

sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)

I would do the former, but then handle it in the function withmutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_) after the expand_grid

sjspielman commented 1 week ago

I would do the former, but then handle it in the function with mutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_)

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄

jashapiro commented 1 week ago

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄

The reason we can't just let calculate_clusters ignore it is the multiple runs problem. The mutate function should be followed by a distinct to remove repeats.

sjspielman commented 1 week ago

This is getting closer, but is definitely not there yet. Some updates:

jashapiro commented 1 week ago

Speaking of cluster_args - this is a problem! We don't check that these are sane for the given algorithm, and I tend to think we indeed should not be in the business of doing so since it depends on igraph. That said, bluster will fail if an irrelevant parameter is passed in, and that's very much a possibility with the sweep function. Here's what I've thought of (noting I prefer the latter) -

  • We could cancel cluster_args for the sweep function
  • We could have users supply this argument a nested list per algorithm, e.g. cluster_args = list(louvain = ..., walktrap = ....)).

I would vote for no cluster_args in the sweep function. If you really need it, you can write your own sweep. And if there are particular cluster_args worth adding, we can add support as needed.