biosurf / cyCombine

Robust Integration of Single-Cell Cytometry Datasets
Other
23 stars 6 forks source link

Clarification regarding `correct_data` and `batch_correct` in README and tutorial pages #58

Open denvercal1234GitHub opened 9 hours ago

denvercal1234GitHub commented 9 hours ago

Hi there,

Thanks again for all the help thus far.

Would you mind helping me understand the difference between the clustering method in README with create_som plus correct_data() versus kohohen::som() plus ConsensusClusterPlus::ConsensusClusterPlus() plus batch_correct?

Thank you for your help!

In README, it appears that right after running prepare_data, we go directly to batch_correct() which under the scene do the clustering.

Alternatively, we can run the clustering (pre-correction) separately to get the labels followed by running correct_data()

# Run batch correction
labels <- uncorrected %>%
  normalize(markers = markers,
            norm_method = "scale") %>%
  create_som(markers = markers,
             rlen = 10)

corrected <- uncorrected %>%
  correct_data(label = labels,
               covar = "condition")
saveRDS(corrected, file = "_data/cycombine_raw_corrected.RDS")

But, in https://biosurf.org/cyCombine_Spectralflow_CyTOF.html, it appears to suggest that after running the clustering with kohonen which does not use create_som but rather the kohohen::som() was used.

# Clustering with kohonen 10x10
set.seed(seed)
som_ <- spectral %>%
  dplyr::select(dplyr::all_of(overlap_markers)) %>%
  as.matrix() %>%
  kohonen::som(grid = kohonen::somgrid(xdim = 10, ydim = 10),
               rlen = 10,
               dist.fcts = "euclidean")

cell_clustering_som <- som_$unit.classif
codes <- som_$codes[[1]]

# Meta-clustering with ConsensusClusterPlus
mc <- ConsensusClusterPlus::ConsensusClusterPlus(t(codes), maxK = 35, reps = 100,
                                                 pItem = 0.9, pFeature = 1, plot = F,
                                                 clusterAlg = "hc", innerLinkage = "average", 
                                                 finalLinkage = "average",
                                                 distance = "euclidean", seed = seed)

# Run batch correction
corrected <- uncorrected %>%
  batch_correct(seed = seed,
                xdim = 3,
                ydim = 3,
                norm_method = 'rank',
                ties.method = 'average')
shdam commented 9 hours ago

Hey,

Thanks for your question :)

What you refer to was undoubtedly spam. It appears GitHub has removed the comment - could you remove your reference to it so the link is no longer exposed?

The vignette uses ConsensusClusterPlus because it was @cbligaard's preferred workflow for cell annotation. The annotation was used to substantiate the correction performance biologically but was not used for the correction. Since no labels were provided in batch_correct, a new clustering using create_som is done.

For additional context, create_som uses kohonen::som, and batch_correct is merely a wrapper of the normalize + create_som + correct_data workflow that you reference.

I hope this clarifies things. And thanks for your continued use of cyCombine!

Best regards, Søren

denvercal1234GitHub commented 8 hours ago

Hi Søren,

Thank you so much for your response! I always appreciate your inputs and patience!

After thinking a bit more about it, I wonder whether you could help me assess whether my Workflow below is reasonable?

To summarise, I had 3 different tissues (Tissue column) that I ran over 3 days (Batch_by_RunDay) for different donors. For each day of run (which is my expected batch variable), I included the same donor PBMC sample which is labelled as "anchor" in the anchor column.

My analysis aims to identify difference between tissues and between donor in each tissue.

Screenshot 2024-09-19 at 12 05 59

WORKFLOW:

### STEP 1: Detect batch 
### batch column is the same as "Batch_by_RunDay" in the metadata above 

uncorrected |>
   cyCombine::normalize(norm_method= "scale") |> cyCombine::detect_batch_effect(uncorrected, batch_col = 'batch', out_dir = ".....Singlets_Live/Detection_of_Batch/x20y20rlen25Euclidean/cyCombine_detect_batch_effect"), xdim = 20, ydim = 20, markers= markers_cleaned, seed = 434, name = 'spectral_uncorrected', downsample = NULL, norm_method = "scale")
##### STEP 2: Clustering before correction to get labels 
som_ <- uncorrected %>% 
  cyCombine::normalize(markers = markers, norm_method = 'scale') %>% 
  dplyr::select(all_of(markers)) %>% as.matrix() %>% 
    kohonen::som(grid = kohonen::somgrid(xdim = 15, ydim = 15), 
      rlen = 20, dist.fcts = "euclidean")

codes <- som_$codes[[1]]

mc <- ConsensusClusterPlus(t(codes), maxK = 90, reps = 100,
                           pItem = 0.9, pFeature = 1, plot = 'png', 
                           clusterAlg = "hc", innerLinkage = "average", finalLinkage = "average",
                           distance = "euclidean", seed = seed)
###### STEP 3: Batch correction by specifying both condition and anchor as first pass 
###### Using the clustering result above as label

corrected_STEP3 <- uncorrected %>%
  cyCombine::correct_data(label = mc, markers = markers, method = "ComBat", covar = "Tissue", anchor = "anchor", ref.batch = NULL)
###### STEP 4: If STEP 3 produces any confounding between the anchor and the batch in any of the cluster, then discard results from STEP 3 and perform this step as the selected batch corrected data 
###### Using the clustering result above as label

corrected_STEP4 <- uncorrected %>%
  cyCombine::correct_data(label = mc, markers = markers, method = "ComBat", covar = "Tissue", anchor = NULL, ref.batch = NULL)
shdam commented 7 hours ago

Hey,

Your workflow is arguably overkill. xdim = 20, ydim = 20 in detect_batch_effect is a bit high - that produces 400 clusters. This makes it likely that the clustering becomes batch-driven, and batch effects may not be reasonably computable. I am curious what the output tells you there, but I suggest reducing it to somewhere between xdim = 4, ydim = 4 and xdim = 8, ydim = 8 (16 - 64 clusters).

There is nothing wrong with using ConsensusClusterPlus if you prefer. But generally, it suffices to run batch_correct directly, consolidating steps 2 and 4 into a single function. Step 3 will always confound in your setup; you will need to choose between using either tissue or anchor, where anchor is rarely the recommended choice. Feel free to try both ways.

In batch_correct, I would not recommend using much more than xdim = 8, ydim = 8, i.e., 64 clusters. The number of clusters you choose is fairly important. Fewer clusters are better at correcting heavy batch effects, while more clusters are fine if you don't expect too many batch effects. However, you should be mindful of how many cells end up in each cluster. Your entire workflow could reasonably become:

# Tissue-aware correction
corrected_tissue <- cyCombine::batch_correct(uncorrected, covar = "Tissue", markers = markers, seed = 434)
# Anchor-specific correction
corrected_anchor <- cyCombine::batch_correct(uncorrected, anchor = "anchor", markers = markers, seed = 434)

You can add norm_method, xdim/ydim, etc. to make it more explicit if you'd like.

Hope this was helpful, and good luck with your analysis :)

Best regards, Søren

denvercal1234GitHub commented 7 hours ago

Thank you so much, Søren!

When you get a moment, could you help clarify the followup points?

Q1 Is it accurate that if we decide to do clustering in order to use "label" in the detect_batch_effect(), then we should use this same "label" in batch_correct()? And, if we decide to just skip the clustering (i.e., correct the data as a whole), then we just don't do any separate clustering step and simply specify label=NULL in both functions?

So, the entire workflow will just be (without any separate clustering steps, besides what is done within the function):

non_markers <- tolower(c(cyCombine::non_markers, "id", "Time", "FileName", "FileNo", "sample", "batch", "condition", "anchor", "FSCA", "FSCH", "SSCA", "SSCBA", "SSCBH", "SSCH", "CompAFA", "CompLIVEDEADBlueA"))

uncorrected <- cyCombine::prepare_data(batch_ids = "Batch_by_RunDay", condition = "Tissue",  anchor = "anchor", sample_ids = "File_Name_exportedSingletLive", cofactor = 6000, transform = FALSE, derand = FALSE, markers = markers, down_sample=FALSE, data_dir = data_dir, metadata = paste0("", "/Sample_Info_noNA.xlsx"), filename_col = "channel", clean_colnames = TRUE, panel = panel, panel_antigen = "antigen", panel_channel = "antigen")

## Don't specify label_col argument unless did a pre-clustering step separately
uncorrected |> cyCombine::detect_batch_effect(uncorrected, batch_col = 'batch', out_dir = "..../x8y8rlen30Euclidean/cyCombine_detect_batch_effect", xdim = 5, ydim = 5, norm_method = "scale",  markers= F64LiveSinglet_sfc_markers_cleaned, seed =6157, name = 'F64LiveSinglet_spectral_uncorrected', downsample = NULL)

# Tissue-aware correction
corrected_tissue <- cyCombine::batch_correct(uncorrected, covar = "Tissue", markers = markers, seed = 434, label = NULL, rlen=30, xim=5, ydim=5, anchor=NULL)

# Anchor-specific correction
corrected_anchor <- cyCombine::batch_correct(uncorrected, anchor = "anchor", markers = markers, seed = 434, label = NULL, rlen=30, xim=5, ydim=5, covar=NULL)

Q2. If we expect some tissues might contain cells that are not present in another tissue, does clustering before batch detection (and correction) help the algorithm at all?

Q3. Is it possible to modify the UMAP generated by detect_batch_effect()? For example, I hope to "split.by" the UMAP by Batch instead of "group.by" Batch. The overlapping dots are hard to see whether there is a batch effect.

Q4. Do you think we can make any generalisation regarding the values for xdim, ydim based on number of cells? My whole dataset is 4M cells.

denvercal1234GitHub commented 4 hours ago

Update:

As reported in #55, the Warning after running up to detect_batch_effect persisted. Note however that in #55, it followed the workflow that had pre-computed labels. Here the workflow (above) did not use any pre-clustered results as labels.

In #55, it was suggested to increase the binSize, but I am unsure how?

From the output below, it said there is no batch effect. But, by examining the gMFI per every marker (x-axis) for the PBMC batch sample for every day, I think there are differences. Also, MSD plots suggest there are batch effect.

Screenshot 2024-09-19 at 16 29 20
> uncorrected |> cyCombine::detect_batch_effect(batch_col = 'batch', norm_method = "scale",  xdim = 5, ydim = 5,  out_dir = "....", markers= markers, seed =6157, name = 'F64LiveSinglet_spectral_uncorrected', downsample = NULL)

Determining new cell type labels using SOM:
Creating SOM grid..
Scaling expression data..
Warning: emd: Maximum number of iterations has been reached (500)Warning: emd: Maximum number of iterations has been reached (500)...
There are 0 markers that appear to be outliers in a single batch:

There are 0 clusters, in which a single cluster is strongly over- or underrepresented.
Making UMAP plots for up to 50,000 cells.

The output from detect_batch_effect()

Screenshot 2024-09-19 at 16 30 14

The output from `detect_batch_effect_express(). Note the function no longer produces the log?

cyCombine::detect_batch_effect_express(uncorrected, batch_col = "batch", downsample = NULL, out_dir = '...')
Screenshot 2024-09-19 at 18 37 01

EMD for CD8A, CD8B and CD4 are highest, but their histogram plotting did not seem to suggest the same?

Screenshot 2024-09-19 at 18 37 54 Screenshot 2024-09-19 at 18 38 50 Screenshot 2024-09-19 at 18 39 05