Closed allyhawkins closed 1 year ago
I haven't gone into detail looking at your code, but I was confused in my first look: I would expect you to use the bluster::pairwiseRand()
function to give a single value for each batch... something like
pairwiseRand(sce_list[[1]]$cluster, merged_sce[sce_merged$batch==1, ]$cluster, mode="index")
But applied to each batch, using batch labels and purrr
most likely...
The key point here, I think, is that there will be n_batches
ARI values each time... which is different from the previous ARI where you were comparing a single set of batch labels for all samples.
I haven't gone into detail looking at your code, but I was confused in my first look: I would expect you to use the
bluster::pairwiseRand()
function to give a single value for each batch... something likepairwiseRand(sce_list[[1]]$cluster, merged_sce[sce_merged$batch==1, ]$cluster, mode="index")
But applied to each batch, using batch labels and
purrr
most likely...The key point here, I think, is that there will be
n_batches
ARI values each time... which is different from the previous ARI where you were comparing a single set of batch labels for all samples.
Okay, looking at the code more, I think it is doing mostly what I would say, but it is a bit hard for me to track given the level of nesting of purrr
. I think it would benefit from separating out some of the internal functions: One would take a single iteration of the clustering and calculate a table of ARIs for each batch. This should also make it easier to test if the ARI calcuation is doing what you expect (there should not be a need for the cluster labels to match, but that is easier to test with a separate function).
One more thought: I think the k-means code here may be inappropriate/biasing against good comparisons. We don't expect the same number of clusters pre and post integration, so fixing K is probably not the best move here. Consider B batches and C cell types, we expect BxC clusters before integration (assuming a big batch effect), and only C clusters after integration. So we might be better off with a single graph-based clustering for this metric.
Thank you @jashapiro! This was what I needed. I was trying to get ARI's for each batch, but with keeping the downsampling and testing different values of k that we had been previously using for the other ARI calculation, I think I was missing something. So I went ahead and did some major simplification for now, mostly to help with review, and to make sure that the calculation is correct. Then we can add in downsampling/ different k's later I think in a separate PR.
I didn't quite do what you suggested by breaking it out into a completely separate function, because once I simplified things I think it got a lot clearer. This now returns a data frame with just the ARI and the batch ID. I also switched to using graph based clustering. Also these updates now give an ARI of 1 when looking at the simulated data. 🎉
I went through all the comments and made most of the suggested changes with the exception of adding a pc_name
argument to the new function to get a pca matrix from the individual objects. If there are strong feelings I can add that in, I just didn't think it was entirely necessary for the purpose of that function. I also updated the check so that this can now only be used when all batches in a merged object are present, so I added the rest of the batches into testing and everything still looks good in terms of calculating ARI.
This is ready for another round of review.
Closes #202
Here I added a new function for calculating the ARI between clustering pre and post integration. The function takes as input a list of individual SCE objects, before they go through merging, and the resulting integrated object containing the corrected PCs. For each individual SCE object and then integrated object, we grab the PCs and then go through downsampling and clustering using k-means. Then for each batch the ARI is computed between the clusters from the individual SCE object vs. the clusters from the integrated object that pertain to only the batch being compared. The result is a data frame with ARI, batch id, k, rep, and integration method.
A few notes: