Function for comparing pre correction clustering to post correction clustering

In research focus, we discussed the idea of testing a new integration metric to measure the correlation between clustering assignments before and after integration for a given set of samples. This metric is intended to identify if there is some over correction happening or all cells are being pushed together. To measure this we would want to calculate the ARI between the pre-correction and post-correction assignments. However, instead of calculating it for the entire merged SCE object, I believe we would want to obtain an ARI for each library or batch included in the integrated object.

We will want to create a new function that is able to take in an integrated SCE object and a batch column and calculate the ARI between that and the integrated clustering results. Then for each batch it would get an ARI for only the assignments corresponding to that batch. I think we also will want to downsample and perform k-means clustering across a range of K for both the pre-integrated and integrated data and compare for each corresponding value and do that for each batch.

This is fairly similar to the current calculate_ari function that we have, but we need to consider adding in the pre-integration clustering. If we can modify that function that would be great, but I don't want to risk messing up the accuracy of that function so it might be best to keep things separate for now as we test this out. The ultimate goal would be able to produce a similar plot to the batch/celltype ARI's that are currently in the summary notebook.

We should see this section of OSCA for some guidance when tackling this.

AlexsLemonade / sc-data-integration

Function for comparing pre correction clustering to post correction clustering #202