Add function for calculating reverse ARI

allyhawkins commented 1 year ago

Closes #202

Here I added a new function for calculating the ARI between clustering pre and post integration. The function takes as input a list of individual SCE objects, before they go through merging, and the resulting integrated object containing the corrected PCs. For each individual SCE object and then integrated object, we grab the PCs and then go through downsampling and clustering using k-means. Then for each batch the ARI is computed between the clusters from the individual SCE object vs. the clusters from the integrated object that pertain to only the batch being compared. The result is a data frame with ARI, batch id, k, rep, and integration method.

A few notes:

I wanted to be able to test with either scpca or simulated data. For the simulated data we didn't have pre-merged PCA that had been calculated so I made a small helper function that I added to just grab HVGs and calculate PCA for each provided SCE object if it doesn't exist already. If it does exist, it just grabs the PC matrix.
For downsampling the integrated pcs, I had to do it slightly different here because after downsampling and clustering I need to subset to have a vector of the cluster assignments for only the batch id that we are interested in. This means that it needs to downsample the same fraction of cells across each batch in the integrated object, which our function didn't do. So I did some of my own data wrangling to downsample before looping through values of k.
I am slightly concerned about the actual ARI values that I'm getting, but I believe what I'm doing is the right principle based on the discussion in OSCA. There they used the clusters obtained from each individual SCE object and compared to the clustering in the integrated object, looking at clusters for only that batch. However, aren't the cluster assignments themselves arbitrary for each object? As in the first cell listed will be in cluster 1 and then if the next cell is similar it will be in that same cluster or assigned to cluster 2 so comparing those labels directly seems error-prone. I'm getting a lot of ARI's around 0.01 which would make sense with arbitrary cluster assignments, but not if heterogeneity is preserved as they show in OSCA. I feel like I'm missing some sort of ordering or sorting step first, although I couldn't find that in any of the code in OSCA. Any ideas or thoughts on this would be helpful, to make sure I'm not missing something in the calculation.
I also added a few lines to the test script to test calculating this with the simulated data.

jashapiro commented 1 year ago

I haven't gone into detail looking at your code, but I was confused in my first look: I would expect you to use the bluster::pairwiseRand() function to give a single value for each batch... something like

pairwiseRand(sce_list[[1]]$cluster, merged_sce[sce_merged$batch==1, ]$cluster, mode="index")

But applied to each batch, using batch labels and purrr most likely...

The key point here, I think, is that there will be n_batches ARI values each time... which is different from the previous ARI where you were comparing a single set of batch labels for all samples.

jashapiro commented 1 year ago

I haven't gone into detail looking at your code, but I was confused in my first look: I would expect you to use the bluster::pairwiseRand() function to give a single value for each batch... something like
pairwiseRand(sce_list[[1]]$cluster, merged_sce[sce_merged$batch==1, ]$cluster, mode="index")
But applied to each batch, using batch labels and purrr most likely...

The key point here, I think, is that there will be n_batches ARI values each time... which is different from the previous ARI where you were comparing a single set of batch labels for all samples.

Okay, looking at the code more, I think it is doing mostly what I would say, but it is a bit hard for me to track given the level of nesting of purrr. I think it would benefit from separating out some of the internal functions: One would take a single iteration of the clustering and calculate a table of ARIs for each batch. This should also make it easier to test if the ARI calcuation is doing what you expect (there should not be a need for the cluster labels to match, but that is easier to test with a separate function).

jashapiro commented 1 year ago

One more thought: I think the k-means code here may be inappropriate/biasing against good comparisons. We don't expect the same number of clusters pre and post integration, so fixing K is probably not the best move here. Consider B batches and C cell types, we expect BxC clusters before integration (assuming a big batch effect), and only C clusters after integration. So we might be better off with a single graph-based clustering for this metric.

allyhawkins commented 1 year ago

Thank you @jashapiro! This was what I needed. I was trying to get ARI's for each batch, but with keeping the downsampling and testing different values of k that we had been previously using for the other ARI calculation, I think I was missing something. So I went ahead and did some major simplification for now, mostly to help with review, and to make sure that the calculation is correct. Then we can add in downsampling/ different k's later I think in a separate PR.

I didn't quite do what you suggested by breaking it out into a completely separate function, because once I simplified things I think it got a lot clearer. This now returns a data frame with just the ARI and the batch ID. I also switched to using graph based clustering. Also these updates now give an ARI of 1 when looking at the simulated data. 🎉

allyhawkins commented 1 year ago

I went through all the comments and made most of the suggested changes with the exception of adding a pc_name argument to the new function to get a pca matrix from the individual objects. If there are strong feelings I can add that in, I just didn't think it was entirely necessary for the purpose of that function. I also updated the check so that this can now only be used when all batches in a merged object are present, so I added the rest of the batches into testing and everything still looks good in terms of calculating ARI.

This is ready for another round of review.

AlexsLemonade / sc-data-integration

Add function for calculating reverse ARI #206