Open denvercal1234GitHub opened 9 hours ago
Hey,
Thanks for your question :)
What you refer to was undoubtedly spam. It appears GitHub has removed the comment - could you remove your reference to it so the link is no longer exposed?
The vignette uses ConsensusClusterPlus because it was @cbligaard's preferred workflow for cell annotation. The annotation was used to substantiate the correction performance biologically but was not used for the correction.
Since no labels were provided in batch_correct
, a new clustering using create_som
is done.
For additional context, create_som
uses kohonen::som
, and batch_correct
is merely a wrapper of the normalize
+ create_som
+ correct_data
workflow that you reference.
I hope this clarifies things.
And thanks for your continued use of cyCombine
!
Best regards, Søren
Hi Søren,
Thank you so much for your response! I always appreciate your inputs and patience!
After thinking a bit more about it, I wonder whether you could help me assess whether my Workflow below is reasonable?
To summarise, I had 3 different tissues (Tissue column) that I ran over 3 days (Batch_by_RunDay) for different donors. For each day of run (which is my expected batch variable), I included the same donor PBMC sample which is labelled as "anchor" in the anchor column.
My analysis aims to identify difference between tissues and between donor in each tissue.
WORKFLOW:
### STEP 1: Detect batch
### batch column is the same as "Batch_by_RunDay" in the metadata above
uncorrected |>
cyCombine::normalize(norm_method= "scale") |> cyCombine::detect_batch_effect(uncorrected, batch_col = 'batch', out_dir = ".....Singlets_Live/Detection_of_Batch/x20y20rlen25Euclidean/cyCombine_detect_batch_effect"), xdim = 20, ydim = 20, markers= markers_cleaned, seed = 434, name = 'spectral_uncorrected', downsample = NULL, norm_method = "scale")
##### STEP 2: Clustering before correction to get labels
som_ <- uncorrected %>%
cyCombine::normalize(markers = markers, norm_method = 'scale') %>%
dplyr::select(all_of(markers)) %>% as.matrix() %>%
kohonen::som(grid = kohonen::somgrid(xdim = 15, ydim = 15),
rlen = 20, dist.fcts = "euclidean")
codes <- som_$codes[[1]]
mc <- ConsensusClusterPlus(t(codes), maxK = 90, reps = 100,
pItem = 0.9, pFeature = 1, plot = 'png',
clusterAlg = "hc", innerLinkage = "average", finalLinkage = "average",
distance = "euclidean", seed = seed)
###### STEP 3: Batch correction by specifying both condition and anchor as first pass
###### Using the clustering result above as label
corrected_STEP3 <- uncorrected %>%
cyCombine::correct_data(label = mc, markers = markers, method = "ComBat", covar = "Tissue", anchor = "anchor", ref.batch = NULL)
###### STEP 4: If STEP 3 produces any confounding between the anchor and the batch in any of the cluster, then discard results from STEP 3 and perform this step as the selected batch corrected data
###### Using the clustering result above as label
corrected_STEP4 <- uncorrected %>%
cyCombine::correct_data(label = mc, markers = markers, method = "ComBat", covar = "Tissue", anchor = NULL, ref.batch = NULL)
Hey,
Your workflow is arguably overkill.
xdim = 20, ydim = 20
in detect_batch_effect
is a bit high - that produces 400 clusters. This makes it likely that the clustering becomes batch-driven, and batch effects may not be reasonably computable. I am curious what the output tells you there, but I suggest reducing it to somewhere between xdim = 4, ydim = 4
and xdim = 8, ydim = 8
(16 - 64 clusters).
There is nothing wrong with using ConsensusClusterPlus
if you prefer. But generally, it suffices to run batch_correct
directly, consolidating steps 2 and 4 into a single function. Step 3 will always confound in your setup; you will need to choose between using either tissue or anchor, where anchor is rarely the recommended choice. Feel free to try both ways.
In batch_correct
, I would not recommend using much more than xdim = 8, ydim = 8
, i.e., 64 clusters. The number of clusters you choose is fairly important. Fewer clusters are better at correcting heavy batch effects, while more clusters are fine if you don't expect too many batch effects. However, you should be mindful of how many cells end up in each cluster.
Your entire workflow could reasonably become:
# Tissue-aware correction
corrected_tissue <- cyCombine::batch_correct(uncorrected, covar = "Tissue", markers = markers, seed = 434)
# Anchor-specific correction
corrected_anchor <- cyCombine::batch_correct(uncorrected, anchor = "anchor", markers = markers, seed = 434)
You can add norm_method
, xdim
/ydim
, etc. to make it more explicit if you'd like.
Hope this was helpful, and good luck with your analysis :)
Best regards, Søren
Thank you so much, Søren!
When you get a moment, could you help clarify the followup points?
Q1 Is it accurate that if we decide to do clustering in order to use "label" in the detect_batch_effect()
, then we should use this same "label" in batch_correct()
? And, if we decide to just skip the clustering (i.e., correct the data as a whole), then we just don't do any separate clustering step and simply specify label=NULL
in both functions?
So, the entire workflow will just be (without any separate clustering steps, besides what is done within the function):
non_markers <- tolower(c(cyCombine::non_markers, "id", "Time", "FileName", "FileNo", "sample", "batch", "condition", "anchor", "FSCA", "FSCH", "SSCA", "SSCBA", "SSCBH", "SSCH", "CompAFA", "CompLIVEDEADBlueA"))
uncorrected <- cyCombine::prepare_data(batch_ids = "Batch_by_RunDay", condition = "Tissue", anchor = "anchor", sample_ids = "File_Name_exportedSingletLive", cofactor = 6000, transform = FALSE, derand = FALSE, markers = markers, down_sample=FALSE, data_dir = data_dir, metadata = paste0("", "/Sample_Info_noNA.xlsx"), filename_col = "channel", clean_colnames = TRUE, panel = panel, panel_antigen = "antigen", panel_channel = "antigen")
## Don't specify label_col argument unless did a pre-clustering step separately
uncorrected |> cyCombine::detect_batch_effect(uncorrected, batch_col = 'batch', out_dir = "..../x8y8rlen30Euclidean/cyCombine_detect_batch_effect", xdim = 5, ydim = 5, norm_method = "scale", markers= F64LiveSinglet_sfc_markers_cleaned, seed =6157, name = 'F64LiveSinglet_spectral_uncorrected', downsample = NULL)
# Tissue-aware correction
corrected_tissue <- cyCombine::batch_correct(uncorrected, covar = "Tissue", markers = markers, seed = 434, label = NULL, rlen=30, xim=5, ydim=5, anchor=NULL)
# Anchor-specific correction
corrected_anchor <- cyCombine::batch_correct(uncorrected, anchor = "anchor", markers = markers, seed = 434, label = NULL, rlen=30, xim=5, ydim=5, covar=NULL)
Q2. If we expect some tissues might contain cells that are not present in another tissue, does clustering before batch detection (and correction) help the algorithm at all?
Q3. Is it possible to modify the UMAP generated by detect_batch_effect()
? For example, I hope to "split.by" the UMAP by Batch instead of "group.by" Batch. The overlapping dots are hard to see whether there is a batch effect.
Q4. Do you think we can make any generalisation regarding the values for xdim, ydim based on number of cells? My whole dataset is 4M cells.
Update:
As reported in #55, the Warning after running up to detect_batch_effect
persisted. Note however that in #55, it followed the workflow that had pre-computed labels. Here the workflow (above) did not use any pre-clustered results as labels.
In #55, it was suggested to increase the binSize, but I am unsure how?
From the output below, it said there is no batch effect. But, by examining the gMFI per every marker (x-axis) for the PBMC batch sample for every day, I think there are differences. Also, MSD plots suggest there are batch effect.
> uncorrected |> cyCombine::detect_batch_effect(batch_col = 'batch', norm_method = "scale", xdim = 5, ydim = 5, out_dir = "....", markers= markers, seed =6157, name = 'F64LiveSinglet_spectral_uncorrected', downsample = NULL)
Determining new cell type labels using SOM:
Creating SOM grid..
Scaling expression data..
Warning: emd: Maximum number of iterations has been reached (500)Warning: emd: Maximum number of iterations has been reached (500)...
There are 0 markers that appear to be outliers in a single batch:
There are 0 clusters, in which a single cluster is strongly over- or underrepresented.
Making UMAP plots for up to 50,000 cells.
The output from detect_batch_effect()
The output from `detect_batch_effect_express(). Note the function no longer produces the log?
cyCombine::detect_batch_effect_express(uncorrected, batch_col = "batch", downsample = NULL, out_dir = '...')
EMD for CD8A, CD8B and CD4 are highest, but their histogram plotting did not seem to suggest the same?
Hi there,
Thanks again for all the help thus far.
Would you mind helping me understand the difference between the clustering method in README with
create_som
pluscorrect_data()
versuskohohen::som()
plusConsensusClusterPlus::ConsensusClusterPlus()
plusbatch_correct
?Thank you for your help!
In README, it appears that right after running
prepare_data
, we go directly tobatch_correct()
which under the scene do the clustering.Alternatively, we can run the clustering (pre-correction) separately to get the
labels
followed by runningcorrect_data()
But, in https://biosurf.org/cyCombine_Spectralflow_CyTOF.html, it appears to suggest that after running the clustering with kohonen which does not use
create_som
but rather thekohohen::som()
was used.