AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
1 stars 13 forks source link

rOpenScpCA: Add function to sweep clustering parameters #755

Open sjspielman opened 1 week ago

sjspielman commented 1 week ago

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

https://github.com/AlexsLemonade/OpenScPCA-analysis/discussions/727 & https://github.com/AlexsLemonade/OpenScPCA-analysis/discussions/731

Describe the goals of the changes to the analysis module.

This issue tracks adding a function that can be used to sweep clustering parameters, e.g. nearest neighbor values. While there exists a function bluster::clusterSweep(), implementing this functionality ourselves will allow for improved reproducibility since we can ensure that the seed is set fresh at each parameter iteration.

Additional details will be filled in about this function once #749 & #754 have been implemented since some details may still be up in the air.

What will your pull request contain?

Function and tests.

Will you require additional software beyond what is already in the analysis module?

N/A

Will you require different computational resources beyond what the analysis module already uses?

N/A

If known, when do you expect to file the pull request?

No response

sjspielman commented 1 day ago

Time for additional implementation thoughts. A big question is how flexible we want this function to be, specifically how many parameters should it be allowed to sweep at once and should we limit which parameters can be swept? bluster::clusterSweep() is entirely flexible about this, but we don't need that level, and if someone does, we should just tell them to use bluster directly.

I am thinking it would be reasonable to only allow sweeping across parameters we specifically have arguments for, but not something in cluster_args(). Implementation-wise, I envision using expand.grid making a data frame with all combinations of provided parameters (one row for one parameter combination), and map over rows of that data frame calling our clustering function as we go. I'm also thinking I'll end up moving some argument checks into their own functions to avoid duplicative code.

Tagging @jashapiro for any thoughts on this plan!

jashapiro commented 1 day ago

I don't think you need to be too flexible: using the defined arguments seems like a good compromise (some of those do go into cluster_args, but I know what you mean).

I don't think you really need to add additional argument checks; I suppose if you really want to you can, but I don't see value in creating separate functions to do those. They should be one line at most, unless there is something I am missing.

Oh, and I assume you were planning to use tidyr::expand_grid() rather than the builtin. Just in case.

sjspielman commented 1 day ago

Oh, and I assume you were planning to use tidyr::expand_grid() rather than the builtin

💯

sjspielman commented 1 day ago

I don't see value in creating separate functions to do those. They should be one line at most, unless there is something I am missing.

Agree, certainly not several functions. Really, I just want to split out preparing the PCA matrix - check the kind of object, and ensure we end up with a matrix with row names.

Edit - eh, I guess I don't need to do this either!

jashapiro commented 1 day ago

Agree, certainly not several functions. Really, I just want to split out preparing the PCA matrix - check the kind of object, and ensure we end up with a matrix with row names.

Edit - eh, I guess I don't need to do this either!

I would probably do some quick profiling here to check how free pulling out the matrix is. If it comes with rownames I expect it is essentially free, but if it doesn't I would go ahead and pull it at the start of the sweep function. I might do that anyway (pull out the matrix) and not worry about the fact that there might be a bit of duplicated code.