Add function to calculate batch ASW

cbethell commented 1 year ago

Is your feature request related to a problem? Please describe. Per https://github.com/AlexsLemonade/scpca-downstream-analyses/issues/352#issuecomment-1593053590: We are finding that we may have more than one use case for the integration performance metrics used in the sc-data-integration repo. That said, we should get these functions in a single place so that they can be utilized across multiple repos post-integration.

Describe the solution you'd like Add a function for calculating batch average silhouette width (ASW) similar to the script in the sc-data-integration repo: https://github.com/AlexsLemonade/sc-data-integration/blob/main/scripts/utils/calculate-silhouette-width.R

cbethell commented 1 year ago

The expected input here will be:

A SingleCellExperiment object containing integrated (or unintegrated) results
A string specifying the batch column
The integration method(s) that were used
TRUE or FALSE indicating whether or not integration has been performed; default will be unintegrated = FALSE (will be useful for comparing integrated to unintegrated results)
A seed to set for subsampling
The number of PCs to use during k-means clustering (with 20 as the default as in the sc-data-integration repo)

The expected output of this function should then be a table of silhouette width results with the following columns:

rep - the given downsampling replicate
silhouette_width - the calculated silhouette width for the given rep
silhouette_cluster - the assigned cluster for the cell during silhouette calculation, i.e. the true identity
integration_method - the given integration method
batch - the batch ID

allyhawkins commented 1 year ago

In looking at the integration metrics in more detail, I think we want to change what you have here slightly.

For these functions, our expected input should be the merged object with the PCs calculated after merging but before batch correction and then the merged and integrated object containing any batch-corrected PCs. Regardless of if the PCs were calculated before or after integration, calculating the metric is the same. I think we should remove the unintegrated flag and the integration_method argument that we have in the original functions and instead have a pc_name argument. This argument would take in the names of all the PCs in the object that we want to calculate the metric for (e.g., c("PCA", "fastmnn_PCA", "harmony_PCA")).
The setup prior to calculating the ASW is the exact same as setting up for the batch or cell type ARI. This includes grabbing the PCs from the integrated object and downsampling X number of times. Because of this, we should consider creating a single wrapper function that can calculate the ARI or the ASW.

See below for my thoughts on how we should set up the wrapper function:

input:
- merged SCE object with integration results
- vector of PC names (e.g., c("PCA", "fastmnn_PCA", "harmony_PCA"))
- label column (batch or celltype)
- fraction of cells to downsample
- number of reps to perform downsampling
- number of PCs to keep when downsampling
- metric (e.g., ASW or ARI)
steps in function:
- grab all PCS specified with PC_name from integrated and merged object
- for each rep:
- downsample all PCs
- if metric is ASW:
  - run ASW function
  - return data frame with silhouette_width, silhouette_width, other_cluster, PC_name
- if metric is ARI:
  - run ARI function
  - return data frame with ari, PC_name
- bind into one dataframe and add in rep
output:
- Dataframe containing either silhouette_width, silhouette_width, other_cluster, PC_name, and rep OR ari, PC_name, and rep

Then the function that actually calculates the ASW would look like this.

input:
- PCs
- PC_name
- label column (cell type or batch)
steps in function:
- calculate silhouette width using bluster::approxSilhouette
output:
- data frame with PC_name, silhouette_width, silhouette_width, and other_cluster

I'm going to tag folks that are out on Monday for their opinions. In the meantime @cbethell let me know if you have any additional thoughts.

allyhawkins commented 1 year ago

@sjspielman and @jashapiro could you please leave any additional thoughts or comments by EOD on Wednesday 6/28? Thank you!

sjspielman commented 1 year ago

This comment also applies to #175

In pseudocode terms, here's how I read the proposed approach -

calculate_metric(type = "ARI or ASW") {
   pc_and_downsample()
   if type is ARI { call other function to calculate_ari }
   else if type is ASW { call other function to calculate_asw }
}

I think I favor an approach more like this, which is more modular and similar to what we did in sc-data-integration https://github.com/AlexsLemonade/sc-data-integration/blob/30f200c12e7259590cc12d571837e632bc92a8aa/scripts/utils/integration-helpers.R#L528

calculate_ari() {
    pc_and_downsample()
   ...calculations...
}
calculate_asw() {
    pc_and_downsample()
   ...calculations...
}

This way, the output contents do not depend on the specific input metric and I think it makes code more maintainable long term. That said, my approach may be slightly less efficient since it calls the pc_and_downsample() function twice, and the wrapper approach @allyhawkins suggested does not! I don't have a clear sense of what kind of performance hit this would cause, but I suspect fairly minor? Either way, one could also do something like this -

# calculate this first for less repetitive code
downsampled <- pc_and_downsample()

calculate_ari(downsampled) {
   ...calculations...
}
calculate_asw(downsampled) {
   ...calculations...
}

allyhawkins commented 1 year ago

Either way, one could also do something like this -

# calculate this first for less repetitive code
downsampled <- pc_and_downsample()

calculate_ari(downsampled) {
   ...calculations...
}
calculate_asw(downsampled) {
   ...calculations...
}

Thanks for the input! This is what I was trying to describe in my original comment, just less clear. The prep for ARI and ASW are the exact same, so everything up until and including downsampling PCs is in the wrapper and then run either calculate ARI or calculate ASW. I think we should avoid repeating downsampling.

jashapiro commented 1 year ago

I think I disagree here. Downsampling is cheap, but storing downsampled matrices is not. So if the functions here are going to do repeated downsampling, then doing that within each of the functions should not be expensive, and leaving things similar to the source function would seem fine to me. (With the downsampling wrapped into non-exported function presumably). That said, I'm not wholly opposed to a wrapper function, but if the output from ARI and ASW will be different data frames, I'm not sure there is much value in the unified function, as downsampling will have to be repeated anyway.

I agree that these functions should just use the PC names as input, and do not need to know whether the data was integrated or not.

One thing I would modify in the source functions is that we can be a bit more efficient in the looping: Rather than explicit loops you can use purrr::map(1:nrep, ...) and then pipe that into a single call to purrr::bind_rows(). This will save you from having to predefine the output dataframe (and make it match the internal function).

One other question I have is about the number of PCs to keep when downsampling: does this make a big difference in speed, or can we just use the PCs that are in the input? I'm not sure whether the extra argument is worth the hassle.

jashapiro commented 1 year ago

I'm also wondering about the choice of kmeans clustering for this metric. Is this something we might want to revisit, perhaps just by adding a argument for the clustering method to use (a bluster parameter object)?

allyhawkins commented 1 year ago

I think I disagree here. Downsampling is cheap, but storing downsampled matrices is not. So if the functions here are going to do repeated downsampling, then doing that within each of the functions should not be expensive, and leaving things similar to the source function would seem fine to me. (With the downsampling wrapped into non-exported function presumably). That said, I'm not wholly opposed to a wrapper function, but if the output from ARI and ASW will be different data frames, I'm not sure there is much value in the unified function, as downsampling will have to be repeated anyway.

The main reason I was trying to create a wrapper function is to eliminate repeating code for the "prep" steps like grabbing the PCs and downsampling. But maybe the following would make more sense where we have three functions 1- downsampling PCs, 2 - calculate ASW , and 3 - calculate ARI.

# separate function used to grab pcs and downsample 
# don't export and only run within ASW and ARI functions
downsample_pcs(SCE object, pc_name, fraction of cells)

calculate_asw(SCE object, pc_name(s), nreps, fraction of cells, label column (batch or celltype)){
  purrr::map(1:nrep){

    downsampled <- downsample_pcs(SCE object, pc_name, fraction of cells)
    run bluster::approxSilhouette()

  } 
}

One thing I would modify in the source functions is that we can be a bit more efficient in the looping: Rather than explicit loops you can use purrr::map(1:nrep, ...) and then pipe that into a single call to purrr::bind_rows(). This will save you from having to predefine the output dataframe (and make it match the internal function).

Yes, I definitely was not planning on using an actual for loop. I just included that to explain that it would be done on every rep.

One other question I have is about the number of PCs to keep when downsampling: does this make a big difference in speed, or can we just use the PCs that are in the input? I'm not sure whether the extra argument is worth the hassle.

Honestly I don't think there's a huge difference here. I'm good with leaving this off.

I'm also wondering about the choice of kmeans clustering for this metric. Is this something we might want to revisit, perhaps just by adding a argument for the clustering method

Based on https://github.com/AlexsLemonade/scpcaTools/issues/176#issuecomment-1611893765, we are going to create a separate clustering function. This would then be run after downsampling and prior to calculating ARI.

calculate_ari(SCE object, pc_name(s), nreps, fraction of cells, label column (batch or celltype)){
  purrr::map(1:nrep){

    downsampled <- downsample_pcs(SCE object, pc_name, fraction of cells)
    clusters <- get_clusters(downsampled, cluster_type, any other clustering params)

    get ARI between clusters and label_column

  } 
}

I am definitely okay with revisiting the choice of clustering. Previously we had been using kmeans and going across a range of K, which we could still do. We could also do the same thing for graph based instead.

AlexsLemonade / scpcaTools

Add function to calculate batch ASW #177