AlexsLemonade / scpcaTools

BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Add function to calculate batch ASW #177

Closed cbethell closed 1 year ago

cbethell commented 1 year ago

Is your feature request related to a problem? Please describe. Per https://github.com/AlexsLemonade/scpca-downstream-analyses/issues/352#issuecomment-1593053590: We are finding that we may have more than one use case for the integration performance metrics used in the sc-data-integration repo. That said, we should get these functions in a single place so that they can be utilized across multiple repos post-integration.

Describe the solution you'd like Add a function for calculating batch average silhouette width (ASW) similar to the script in the sc-data-integration repo: https://github.com/AlexsLemonade/sc-data-integration/blob/main/scripts/utils/calculate-silhouette-width.R

cbethell commented 1 year ago

The expected input here will be:

  1. A SingleCellExperiment object containing integrated (or unintegrated) results
  2. A string specifying the batch column
  3. The integration method(s) that were used
  4. TRUE or FALSE indicating whether or not integration has been performed; default will be unintegrated = FALSE (will be useful for comparing integrated to unintegrated results)
  5. A seed to set for subsampling
  6. The number of PCs to use during k-means clustering (with 20 as the default as in the sc-data-integration repo)

The expected output of this function should then be a table of silhouette width results with the following columns:

allyhawkins commented 1 year ago

In looking at the integration metrics in more detail, I think we want to change what you have here slightly.

See below for my thoughts on how we should set up the wrapper function:

Then the function that actually calculates the ASW would look like this.

I'm going to tag folks that are out on Monday for their opinions. In the meantime @cbethell let me know if you have any additional thoughts.

allyhawkins commented 1 year ago

@sjspielman and @jashapiro could you please leave any additional thoughts or comments by EOD on Wednesday 6/28? Thank you!

sjspielman commented 1 year ago

This comment also applies to #175

In pseudocode terms, here's how I read the proposed approach -

calculate_metric(type = "ARI or ASW") {
   pc_and_downsample()
   if type is ARI { call other function to calculate_ari }
   else if type is ASW { call other function to calculate_asw }
}

I think I favor an approach more like this, which is more modular and similar to what we did in sc-data-integration https://github.com/AlexsLemonade/sc-data-integration/blob/30f200c12e7259590cc12d571837e632bc92a8aa/scripts/utils/integration-helpers.R#L528

calculate_ari() {
    pc_and_downsample()
   ...calculations...
}
calculate_asw() {
    pc_and_downsample()
   ...calculations...
}

This way, the output contents do not depend on the specific input metric and I think it makes code more maintainable long term. That said, my approach may be slightly less efficient since it calls the pc_and_downsample() function twice, and the wrapper approach @allyhawkins suggested does not! I don't have a clear sense of what kind of performance hit this would cause, but I suspect fairly minor? Either way, one could also do something like this -

# calculate this first for less repetitive code
downsampled <- pc_and_downsample()

calculate_ari(downsampled) {
   ...calculations...
}
calculate_asw(downsampled) {
   ...calculations...
}
allyhawkins commented 1 year ago

Either way, one could also do something like this -

# calculate this first for less repetitive code
downsampled <- pc_and_downsample()

calculate_ari(downsampled) {
   ...calculations...
}
calculate_asw(downsampled) {
   ...calculations...
}

Thanks for the input! This is what I was trying to describe in my original comment, just less clear. The prep for ARI and ASW are the exact same, so everything up until and including downsampling PCs is in the wrapper and then run either calculate ARI or calculate ASW. I think we should avoid repeating downsampling.

jashapiro commented 1 year ago

I think I disagree here. Downsampling is cheap, but storing downsampled matrices is not. So if the functions here are going to do repeated downsampling, then doing that within each of the functions should not be expensive, and leaving things similar to the source function would seem fine to me. (With the downsampling wrapped into non-exported function presumably). That said, I'm not wholly opposed to a wrapper function, but if the output from ARI and ASW will be different data frames, I'm not sure there is much value in the unified function, as downsampling will have to be repeated anyway.

I agree that these functions should just use the PC names as input, and do not need to know whether the data was integrated or not.

One thing I would modify in the source functions is that we can be a bit more efficient in the looping: Rather than explicit loops you can use purrr::map(1:nrep, ...) and then pipe that into a single call to purrr::bind_rows(). This will save you from having to predefine the output dataframe (and make it match the internal function).

One other question I have is about the number of PCs to keep when downsampling: does this make a big difference in speed, or can we just use the PCs that are in the input? I'm not sure whether the extra argument is worth the hassle.

jashapiro commented 1 year ago

I'm also wondering about the choice of kmeans clustering for this metric. Is this something we might want to revisit, perhaps just by adding a argument for the clustering method to use (a bluster parameter object)?

allyhawkins commented 1 year ago

I think I disagree here. Downsampling is cheap, but storing downsampled matrices is not. So if the functions here are going to do repeated downsampling, then doing that within each of the functions should not be expensive, and leaving things similar to the source function would seem fine to me. (With the downsampling wrapped into non-exported function presumably). That said, I'm not wholly opposed to a wrapper function, but if the output from ARI and ASW will be different data frames, I'm not sure there is much value in the unified function, as downsampling will have to be repeated anyway.

The main reason I was trying to create a wrapper function is to eliminate repeating code for the "prep" steps like grabbing the PCs and downsampling. But maybe the following would make more sense where we have three functions 1- downsampling PCs, 2 - calculate ASW , and 3 - calculate ARI.

# separate function used to grab pcs and downsample 
# don't export and only run within ASW and ARI functions
downsample_pcs(SCE object, pc_name, fraction of cells)

calculate_asw(SCE object, pc_name(s), nreps, fraction of cells, label column (batch or celltype)){
  purrr::map(1:nrep){

    downsampled <- downsample_pcs(SCE object, pc_name, fraction of cells)
    run bluster::approxSilhouette()

  } 
}

One thing I would modify in the source functions is that we can be a bit more efficient in the looping: Rather than explicit loops you can use purrr::map(1:nrep, ...) and then pipe that into a single call to purrr::bind_rows(). This will save you from having to predefine the output dataframe (and make it match the internal function).

Yes, I definitely was not planning on using an actual for loop. I just included that to explain that it would be done on every rep.

One other question I have is about the number of PCs to keep when downsampling: does this make a big difference in speed, or can we just use the PCs that are in the input? I'm not sure whether the extra argument is worth the hassle.

Honestly I don't think there's a huge difference here. I'm good with leaving this off.

I'm also wondering about the choice of kmeans clustering for this metric. Is this something we might want to revisit, perhaps just by adding a argument for the clustering method

Based on https://github.com/AlexsLemonade/scpcaTools/issues/176#issuecomment-1611893765, we are going to create a separate clustering function. This would then be run after downsampling and prior to calculating ARI.

calculate_ari(SCE object, pc_name(s), nreps, fraction of cells, label column (batch or celltype)){
  purrr::map(1:nrep){

    downsampled <- downsample_pcs(SCE object, pc_name, fraction of cells)
    clusters <- get_clusters(downsampled, cluster_type, any other clustering params)

    get ARI between clusters and label_column

  } 
}

I am definitely okay with revisiting the choice of clustering. Previously we had been using kmeans and going across a range of K, which we could still do. We could also do the same thing for graph based instead.