Closed cbethell closed 1 year ago
The expected input here will be:
SingleCellExperiment
object containing integrated (or unintegrated) resultsTRUE
or FALSE
indicating whether or not integration has been performed; default will be unintegrated = FALSE
(will be useful for comparing integrated to unintegrated results)The expected output of this function should then be a table of silhouette width results with the following columns:
rep
- the given downsampling replicatesilhouette_width
- the calculated silhouette width for the given rep
silhouette_cluster
- the assigned cluster for the cell during silhouette calculation, i.e. the true identityintegration_method
- the given integration methodbatch
- the batch IDIn looking at the integration metrics in more detail, I think we want to change what you have here slightly.
unintegrated
flag and the integration_method
argument that we have in the original functions and instead have a pc_name
argument. This argument would take in the names of all the PCs in the object that we want to calculate the metric for (e.g., c("PCA", "fastmnn_PCA", "harmony_PCA")
). See below for my thoughts on how we should set up the wrapper function:
c("PCA", "fastmnn_PCA", "harmony_PCA")
)ASW
or ARI
)PC_name
from integrated and merged object silhouette_width
, silhouette_width
, other_cluster
, PC_name
ari
, PC_name
rep
silhouette_width
, silhouette_width
, other_cluster
, PC_name
, and rep
OR ari
, PC_name
, and rep
Then the function that actually calculates the ASW would look like this.
bluster::approxSilhouette
PC_name
, silhouette_width
, silhouette_width
, and other_cluster
I'm going to tag folks that are out on Monday for their opinions. In the meantime @cbethell let me know if you have any additional thoughts.
@sjspielman and @jashapiro could you please leave any additional thoughts or comments by EOD on Wednesday 6/28? Thank you!
This comment also applies to #175
In pseudocode terms, here's how I read the proposed approach -
calculate_metric(type = "ARI or ASW") {
pc_and_downsample()
if type is ARI { call other function to calculate_ari }
else if type is ASW { call other function to calculate_asw }
}
I think I favor an approach more like this, which is more modular and similar to what we did in sc-data-integration
https://github.com/AlexsLemonade/sc-data-integration/blob/30f200c12e7259590cc12d571837e632bc92a8aa/scripts/utils/integration-helpers.R#L528
calculate_ari() {
pc_and_downsample()
...calculations...
}
calculate_asw() {
pc_and_downsample()
...calculations...
}
This way, the output contents do not depend on the specific input metric and I think it makes code more maintainable long term. That said, my approach may be slightly less efficient since it calls the pc_and_downsample()
function twice, and the wrapper approach @allyhawkins suggested does not! I don't have a clear sense of what kind of performance hit this would cause, but I suspect fairly minor? Either way, one could also do something like this -
# calculate this first for less repetitive code
downsampled <- pc_and_downsample()
calculate_ari(downsampled) {
...calculations...
}
calculate_asw(downsampled) {
...calculations...
}
Either way, one could also do something like this -
# calculate this first for less repetitive code downsampled <- pc_and_downsample() calculate_ari(downsampled) { ...calculations... } calculate_asw(downsampled) { ...calculations... }
Thanks for the input! This is what I was trying to describe in my original comment, just less clear. The prep for ARI and ASW are the exact same, so everything up until and including downsampling PCs is in the wrapper and then run either calculate ARI or calculate ASW. I think we should avoid repeating downsampling.
I think I disagree here. Downsampling is cheap, but storing downsampled matrices is not. So if the functions here are going to do repeated downsampling, then doing that within each of the functions should not be expensive, and leaving things similar to the source function would seem fine to me. (With the downsampling wrapped into non-exported function presumably). That said, I'm not wholly opposed to a wrapper function, but if the output from ARI and ASW will be different data frames, I'm not sure there is much value in the unified function, as downsampling will have to be repeated anyway.
I agree that these functions should just use the PC names as input, and do not need to know whether the data was integrated or not.
One thing I would modify in the source functions is that we can be a bit more efficient in the looping: Rather than explicit loops you can use purrr::map(1:nrep, ...)
and then pipe that into a single call to purrr::bind_rows()
. This will save you from having to predefine the output dataframe (and make it match the internal function).
One other question I have is about the number of PCs to keep when downsampling: does this make a big difference in speed, or can we just use the PCs that are in the input? I'm not sure whether the extra argument is worth the hassle.
I'm also wondering about the choice of kmeans clustering for this metric. Is this something we might want to revisit, perhaps just by adding a argument for the clustering method to use (a bluster parameter object)?
I think I disagree here. Downsampling is cheap, but storing downsampled matrices is not. So if the functions here are going to do repeated downsampling, then doing that within each of the functions should not be expensive, and leaving things similar to the source function would seem fine to me. (With the downsampling wrapped into non-exported function presumably). That said, I'm not wholly opposed to a wrapper function, but if the output from ARI and ASW will be different data frames, I'm not sure there is much value in the unified function, as downsampling will have to be repeated anyway.
The main reason I was trying to create a wrapper function is to eliminate repeating code for the "prep" steps like grabbing the PCs and downsampling. But maybe the following would make more sense where we have three functions 1- downsampling PCs, 2 - calculate ASW , and 3 - calculate ARI.
# separate function used to grab pcs and downsample
# don't export and only run within ASW and ARI functions
downsample_pcs(SCE object, pc_name, fraction of cells)
calculate_asw(SCE object, pc_name(s), nreps, fraction of cells, label column (batch or celltype)){
purrr::map(1:nrep){
downsampled <- downsample_pcs(SCE object, pc_name, fraction of cells)
run bluster::approxSilhouette()
}
}
One thing I would modify in the source functions is that we can be a bit more efficient in the looping: Rather than explicit loops you can use purrr::map(1:nrep, ...) and then pipe that into a single call to purrr::bind_rows(). This will save you from having to predefine the output dataframe (and make it match the internal function).
Yes, I definitely was not planning on using an actual for
loop. I just included that to explain that it would be done on every rep.
One other question I have is about the number of PCs to keep when downsampling: does this make a big difference in speed, or can we just use the PCs that are in the input? I'm not sure whether the extra argument is worth the hassle.
Honestly I don't think there's a huge difference here. I'm good with leaving this off.
I'm also wondering about the choice of kmeans clustering for this metric. Is this something we might want to revisit, perhaps just by adding a argument for the clustering method
Based on https://github.com/AlexsLemonade/scpcaTools/issues/176#issuecomment-1611893765, we are going to create a separate clustering function. This would then be run after downsampling and prior to calculating ARI.
calculate_ari(SCE object, pc_name(s), nreps, fraction of cells, label column (batch or celltype)){
purrr::map(1:nrep){
downsampled <- downsample_pcs(SCE object, pc_name, fraction of cells)
clusters <- get_clusters(downsampled, cluster_type, any other clustering params)
get ARI between clusters and label_column
}
}
I am definitely okay with revisiting the choice of clustering. Previously we had been using kmeans and going across a range of K, which we could still do. We could also do the same thing for graph based instead.
Is your feature request related to a problem? Please describe. Per https://github.com/AlexsLemonade/scpca-downstream-analyses/issues/352#issuecomment-1593053590: We are finding that we may have more than one use case for the integration performance metrics used in the
sc-data-integration
repo. That said, we should get these functions in a single place so that they can be utilized across multiple repos post-integration.Describe the solution you'd like Add a function for calculating
batch average silhouette width (ASW)
similar to the script in thesc-data-integration
repo: https://github.com/AlexsLemonade/sc-data-integration/blob/main/scripts/utils/calculate-silhouette-width.R