Clarification regarding `correct_data` and `batch_correct` in README and tutorial pages

biosurf / cyCombine

Robust Integration of Single-Cell Cytometry Datasets

Other

24 stars 7 forks source link

Clarification regarding `correct_data` and `batch_correct` in README and tutorial pages #58

Open denvercal1234GitHub opened 2 months ago

denvercal1234GitHub commented 2 months ago

Hi there,

Thanks again for all the help thus far.

Would you mind helping me understand the difference between the clustering method in README with create_som plus correct_data() versus kohohen::som() plus ConsensusClusterPlus::ConsensusClusterPlus() plus batch_correct?

Thank you for your help!

In README, it appears that right after running prepare_data, we go directly to batch_correct() which under the scene do the clustering.

Alternatively, we can run the clustering (pre-correction) separately to get the labels followed by running correct_data()

# Run batch correction
labels <- uncorrected %>%
  normalize(markers = markers,
            norm_method = "scale") %>%
  create_som(markers = markers,
             rlen = 10)

corrected <- uncorrected %>%
  correct_data(label = labels,
               covar = "condition")
saveRDS(corrected, file = "_data/cycombine_raw_corrected.RDS")

But, in https://biosurf.org/cyCombine_Spectralflow_CyTOF.html, it appears to suggest that after running the clustering with kohonen which does not use create_som but rather the kohohen::som() was used.

# Clustering with kohonen 10x10
set.seed(seed)
som_ <- spectral %>%
  dplyr::select(dplyr::all_of(overlap_markers)) %>%
  as.matrix() %>%
  kohonen::som(grid = kohonen::somgrid(xdim = 10, ydim = 10),
               rlen = 10,
               dist.fcts = "euclidean")

cell_clustering_som <- som_$unit.classif
codes <- som_$codes[[1]]

# Meta-clustering with ConsensusClusterPlus
mc <- ConsensusClusterPlus::ConsensusClusterPlus(t(codes), maxK = 35, reps = 100,
                                                 pItem = 0.9, pFeature = 1, plot = F,
                                                 clusterAlg = "hc", innerLinkage = "average", 
                                                 finalLinkage = "average",
                                                 distance = "euclidean", seed = seed)

# Run batch correction
corrected <- uncorrected %>%
  batch_correct(seed = seed,
                xdim = 3,
                ydim = 3,
                norm_method = 'rank',
                ties.method = 'average')

shdam commented 2 months ago

Hey,

Thanks for your question :)

What you refer to was undoubtedly spam. It appears GitHub has removed the comment - could you remove your reference to it so the link is no longer exposed?

The vignette uses ConsensusClusterPlus because it was @cbligaard's preferred workflow for cell annotation. The annotation was used to substantiate the correction performance biologically but was not used for the correction. Since no labels were provided in batch_correct, a new clustering using create_som is done.

For additional context, create_som uses kohonen::som, and batch_correct is merely a wrapper of the normalize + create_som + correct_data workflow that you reference.

I hope this clarifies things. And thanks for your continued use of cyCombine!

Best regards, Søren

denvercal1234GitHub commented 2 months ago

Hi Søren,

Thank you so much for your response! I always appreciate your inputs and patience!

After thinking a bit more about it, I wonder whether you could help me assess whether my Workflow below is reasonable?

To summarise, I had 3 different tissues (Tissue column) that I ran over 3 days (Batch_by_RunDay) for different donors. For each day of run (which is my expected batch variable), I included the same donor PBMC sample which is labelled as "anchor" in the anchor column.

My analysis aims to identify difference between tissues and between donor in each tissue.

WORKFLOW:

### STEP 1: Detect batch 
### batch column is the same as "Batch_by_RunDay" in the metadata above 

uncorrected |>
   cyCombine::normalize(norm_method= "scale") |> cyCombine::detect_batch_effect(uncorrected, batch_col = 'batch', out_dir = ".....Singlets_Live/Detection_of_Batch/x20y20rlen25Euclidean/cyCombine_detect_batch_effect"), xdim = 20, ydim = 20, markers= markers_cleaned, seed = 434, name = 'spectral_uncorrected', downsample = NULL, norm_method = "scale")

##### STEP 2: Clustering before correction to get labels 
som_ <- uncorrected %>% 
  cyCombine::normalize(markers = markers, norm_method = 'scale') %>% 
  dplyr::select(all_of(markers)) %>% as.matrix() %>% 
    kohonen::som(grid = kohonen::somgrid(xdim = 15, ydim = 15), 
      rlen = 20, dist.fcts = "euclidean")

codes <- som_$codes[[1]]

mc <- ConsensusClusterPlus(t(codes), maxK = 90, reps = 100,
                           pItem = 0.9, pFeature = 1, plot = 'png', 
                           clusterAlg = "hc", innerLinkage = "average", finalLinkage = "average",
                           distance = "euclidean", seed = seed)

###### STEP 3: Batch correction by specifying both condition and anchor as first pass 
###### Using the clustering result above as label

corrected_STEP3 <- uncorrected %>%
  cyCombine::correct_data(label = mc, markers = markers, method = "ComBat", covar = "Tissue", anchor = "anchor", ref.batch = NULL)

###### STEP 4: If STEP 3 produces any confounding between the anchor and the batch in any of the cluster, then discard results from STEP 3 and perform this step as the selected batch corrected data 
###### Using the clustering result above as label

corrected_STEP4 <- uncorrected %>%
  cyCombine::correct_data(label = mc, markers = markers, method = "ComBat", covar = "Tissue", anchor = NULL, ref.batch = NULL)

shdam commented 2 months ago

Hey,

Your workflow is arguably overkill. xdim = 20, ydim = 20 in detect_batch_effect is a bit high - that produces 400 clusters. This makes it likely that the clustering becomes batch-driven, and batch effects may not be reasonably computable. I am curious what the output tells you there, but I suggest reducing it to somewhere between xdim = 4, ydim = 4 and xdim = 8, ydim = 8 (16 - 64 clusters).

There is nothing wrong with using ConsensusClusterPlus if you prefer. But generally, it suffices to run batch_correct directly, consolidating steps 2 and 4 into a single function. Step 3 will always confound in your setup; you will need to choose between using either tissue or anchor, where anchor is rarely the recommended choice. Feel free to try both ways.

In batch_correct, I would not recommend using much more than xdim = 8, ydim = 8, i.e., 64 clusters. The number of clusters you choose is fairly important. Fewer clusters are better at correcting heavy batch effects, while more clusters are fine if you don't expect too many batch effects. However, you should be mindful of how many cells end up in each cluster. Your entire workflow could reasonably become:

# Tissue-aware correction
corrected_tissue <- cyCombine::batch_correct(uncorrected, covar = "Tissue", markers = markers, seed = 434)
# Anchor-specific correction
corrected_anchor <- cyCombine::batch_correct(uncorrected, anchor = "anchor", markers = markers, seed = 434)

You can add norm_method, xdim/ydim, etc. to make it more explicit if you'd like.

Hope this was helpful, and good luck with your analysis :)

Best regards, Søren

denvercal1234GitHub commented 2 months ago

Thank you so much, Søren!

When you get a moment, could you help clarify the followup points?

Q1. Is it accurate that if we decide to do clustering in order to use "label" in the detect_batch_effect(), then we should use this same "label" in batch_correct()? And, if we decide to just skip the clustering (i.e., correct the data as a whole), then we just don't do any separate clustering step and simply specify label=NULL in both functions?

So, the entire workflow will just be (without any separate clustering steps, besides what is done within the function):

non_markers <- tolower(c(cyCombine::non_markers, "id", "Time", "FileName", "FileNo", "sample", "batch", "condition", "anchor", "FSCA", "FSCH", "SSCA", "SSCBA", "SSCBH", "SSCH", "CompAFA", "CompLIVEDEADBlueA"))

uncorrected <- cyCombine::prepare_data(batch_ids = "Batch_by_RunDay", condition = "Tissue",  anchor = "anchor", sample_ids = "File_Name_exportedSingletLive", cofactor = 6000, transform = FALSE, derand = FALSE, markers = markers, down_sample=FALSE, data_dir = data_dir, metadata = paste0("", "/Sample_Info_noNA.xlsx"), filename_col = "channel", clean_colnames = TRUE, panel = panel, panel_antigen = "antigen", panel_channel = "antigen")

## Don't specify label_col argument unless did a pre-clustering step separately
uncorrected |> cyCombine::detect_batch_effect(uncorrected, batch_col = 'batch', out_dir = "..../x8y8rlen30Euclidean/cyCombine_detect_batch_effect", xdim = 5, ydim = 5, norm_method = "scale",  markers= F64LiveSinglet_sfc_markers_cleaned, seed =6157, name = 'F64LiveSinglet_spectral_uncorrected', downsample = NULL)

# Tissue-aware correction
corrected_tissue <- cyCombine::batch_correct(uncorrected, covar = "Tissue", markers = markers, seed = 434, label = NULL, rlen=30, xim=5, ydim=5, anchor=NULL)

# Anchor-specific correction
corrected_anchor <- cyCombine::batch_correct(uncorrected, anchor = "anchor", markers = markers, seed = 434, label = NULL, rlen=30, xim=5, ydim=5, covar=NULL)

Q2. If we expect some tissues might contain cells that are not present in another tissue, does clustering before batch detection (and correction) help the algorithm at all?

Q3. Is it possible to modify the UMAP generated by detect_batch_effect()? For example, I hope to "split.by" the UMAP by Batch instead of "group.by" Batch. The overlapping dots are hard to see whether there is a batch effect.

Q4. Do you think we can make any generalisation regarding the values for xdim, ydim based on number of cells? My whole dataset is 4M cells.

denvercal1234GitHub commented 2 months ago

Update: (I hope these are the last sets of questions. Thanks so much again for helping me better understand and use the package!)

As reported in #55, the Warning after running up to detect_batch_effect persisted. Note however that in #55, it followed the workflow that had pre-computed labels. Here the workflow (above) did not use any pre-clustered results as labels.

Q5. In #55, it was suggested to increase the binSize, but I am unsure how?

Q6. From the output of detect_batch_effect, it said there is no batch effect? But, by examining the gMFI per every marker (x-axis) for the PBMC batch sample for every Day of Run, I think there are differences. Also, MSD plots from detect_batch_effect_express() suggest there are batch effect.

Q7. EMD for CD8A, CD8B and CD4 are highest, but their histogram plotting did not seem to suggest they are different between batches?

Q8. When I run Tissue-aware correction (covar = "condition", anchor=NULL, label = NULL), the output "adjusting for 2 covariates" for the 25 groups/clusters. But, when I run Anchor-specific correction (covar = NULL, anchor="anchor", label = NULL), the output seems strange because it says "in group 1, Anchor is confounded with batch" but then in the remaining groups, it "adjusting for 21 covariates"?

gMFI plots across markers

> uncorrected |> cyCombine::detect_batch_effect(batch_col = 'batch', norm_method = "scale",  xdim = 5, ydim = 5,  out_dir = "....", markers= markers, seed =6157, name = 'F64LiveSinglet_spectral_uncorrected', downsample = NULL)

Determining new cell type labels using SOM:
Creating SOM grid..
Scaling expression data..
Warning: emd: Maximum number of iterations has been reached (500)Warning: emd: Maximum number of iterations has been reached (500)...
There are 0 markers that appear to be outliers in a single batch:

There are 0 clusters, in which a single cluster is strongly over- or underrepresented.
Making UMAP plots for up to 50,000 cells.

The output from detect_batch_effect()

The output from `detect_batch_effect_express(). Note the function no longer produces the log.

cyCombine::detect_batch_effect_express(uncorrected, batch_col = "batch", downsample = NULL, out_dir = '...')

The outputs from batch_correct()

#Tissue-aware correction
> corrected_CovarTissue <- cyCombine::batch_correct(uncorrected, covar = "condition", anchor=NULL, markers = F64LiveSinglet_sfc_markers_cleaned, seed = 6157, label = NULL, rlen=30, xdim=5, ydim=5, norm_method = 'scale', ties.method = 'average', ref.batch = NULL, method="ComBat")
Creating SOM grid..
Scaling expression data..
Batch correcting data..
Correcting Label group 1
Found3batches
Adjusting for2covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data

# Anchor-specific correction
> corrected_AnchorAnchor <- cyCombine::batch_correct(uncorrected, covar=NULL, anchor = "anchor", markers = F64LiveSinglet_sfc_markers_cleaned, seed = 6157, label = NULL, rlen=30, xdim=5, ydim=5, norm_method = 'scale', ties.method = 'average', ref.batch = NULL, method="ComBat")
Creating SOM grid..
Scaling expression data..
Batch correcting data..
Correcting Label group 1
Anchor is confounded with batch. Ignoring anchor in this label group
Found3batches
Adjusting for0covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data

Correcting Label group 2
Found3batches
Adjusting for21covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data

shdam commented 2 months ago

Hey,

Thanks for your questions. I will do my best to answer them.

As a note, in prepare_data, when you set transform = NULL, the cofactor will not be used and data will not be transformed. No transformation will very likely cause problems for EMD computations, as it was tweaked for expression values in the range 0-6.

Q1: Same label: Yes, but not essential - if you do, you'll have a more accurate understanding of the effects corrected. label = NULL: No, this will tell the function to run its own clustering. You have to assign a vector of the same size as your dataframe, e.g., label = rep(1, nrow(uncorrected)) or uncorrected$label <- 1 ... batch_correct(label = "label").

Q2: In that case, clustering is especially important! It is the main purpose of the clustering to ensure that biological variation is retained. ComBat (the correction algorithm) is fairly good at retaining biological variation, but you help it tremendously with the clustering step.

Q3: Not currently, but I think if you store the output plot in a variable and do plot + ggplot2::facet_wrap(~Batch), you may get what you want.

Q4: Not really. It is meant to capture biological diversity, not abundance.

Q5: This is only doable in the development version, installable with remotes::install_github("biosurf/cyCombine", ref = "dev")

Q6: Try transforming your data and see if the problem persists.

Q7: Your intensities are so high that I would not put much faith in EMD until you transform your data - the EMD was tweaked for values between 0 - 6; yours are ~250 - 800.

Q8: Make sure to read the readme section about using anchors. In short, the 21 variables are the non-replicates. You can check which anchors and batches are included in your clusters with table(corrected$anchor, corrected$batch, corrected$label). My guess is that cluster 1 does not have any cells from the anchor samples, if it does then only from one batch.

I hope this was helpful, and best of luck with your analysis :)

Best regards, Søren

denvercal1234GitHub commented 2 months ago

Hi Søren,

Thank you so much for your responses!!

Q1 + Q2. Okay. I will just specify label = NULL for both detect_batch_effect and correct_batch so that the algorithms do their own clustering then, right? (This is actually what I have done in the code above)

Q6 + Q7: I actually did transform the data as following. Do you think I should still try to transform during prepare_data() by setting cofactor = 6000, transform = TRUE?

Step 1. Bi-exponentially transformed the data in FlowJo (because it is visually easier instead of determining per-marker co-factor)

Step 2. Exported the channel values from FlowJo as csv

Step 3. Import these csv into R so that the transformation by FlowJo was preserved, using Spectre::read.files(file.loc = "....exportedChannelValues_Singlets_Live/Full_Stained", file.type = ".csv", do.embed.file.names = TRUE) (it was recommended here, if we want to skip the transformation in R: https://immunedynamics.io/spectre/cytometry/)

Step 4. Exported the data from R as FCS files with the transformation preserved using write.files(data_table_object, write.csv = FALSE, write.fcs = TRUE, file.prefix = "channelFCS", divide.by = "FileName")

Step 5. Import back these FCS with the transformation preserved as a flowSet using read.flowSet(path=fcs.dir1, pattern="*.fcs", transformation = FALSE, truncate_max_range = FALSE), which is now input for prepare_data()

My output of prepare_data is below. The range of values here goes from 0 through 1017.

Q8. Thanks Søren! Below are the examples of the output from table(corrected$anchor, corrected$batch, corrected$label) of the Anchor-specific correction. Do you think in this case, I cannot use the anchor for correction, and should just keep 1 anchor sample as experimental sample in a batch, and simply discard the other 2 anchor samples? Then, just apply the Tissue-specific Batch correction?

shdam commented 2 months ago

Hey,

Glad my answers are helpful :)

Q1+Q2: Sounds good 👍

Q6+Q7: I see. In terms of correction, it should be fine, but anything with EMD should either be further transformed or a more appropriate binSize should be used. How does the output of step 3 look different from prepare_data? - Couldn't you progress directly from there? Alternatively, doesn't exporting as FCS from FlowJo retain transformations? I am asking because your workflow seems a bit convoluted to go back and forth between file formats.

I just read your comment on the Spectre issue - first off, it's very cool that they are incorporating cyCombine! But I realize I was unclear. The batch correction does not expect values in certain ranges, only the EMD calculations used in detect_batch_effect* and evaluate_emd with default settings :)

Q8: Yeah, cluster 1 has basically no anchor samples to model batch effects after. Therefore, it is ignored. You should be completely fine off by removing the replicate samples and correcting using Tissue as covariate.

Best regards, Søren

denvercal1234GitHub commented 2 months ago

Thank you, Søren!!

Q6+Q7. Below is an example of 1 file from the output from Step 3 of my FlowJo-based transformation (Import channel value csv into R, using Spectre::read.files(file.loc = "....exportedChannelValues_Singlets_Live/Full_Stained", file.type = ".csv", do.embed.file.names = TRUE)). So, this output looks similar to that at the prepare_data().

Unfortunately, exporting as FCS from FlowJo will not preserve the transformation.

The output of Step 3 is essentially is a list of dataframes (each dataframe represents a FCS file - which I can just merge into 1 large data.table). (A) Do I just skip prepare_data in order to use this output from my Step 3 (i.e., a single data.table with each row = 1 cell) into cyCombine::detect_batch_effect() directly? BTW, I thought I would need to have either a list of individual FCS files or a flowSet; that was why I did Steps 4 and 5.

I will try to increase the binSize in detect_batch_effect() to 20 (given originally it was 0.1 for values up to 5; mine is 1000, so I just multiplied 0.1 x 200 = 20). (B) Or, do you think it is the same as dividing all the values from my Step 3 of FlowJo transformation by 200 (1000/6 ~ 200)?

shdam commented 2 months ago

The only thing you need is a data.frame with a "batch" column. I got from the Spectre implementation that enforcing a "sample" column is a hassle, and I have removed that requirement in the development version. I really should merge it into main soon..

Regarding binSize vs transformation. It is essentially the same in terms of EMD - feel free to confirm with compute_emd().

denvercal1234GitHub commented 2 months ago

Q3: Not currently, but I think if you store the output plot in a variable and do plot + ggplot2::facet_wrap(~Batch), you may get what you want.

Hi Søren, Thank you for your continued effort.

I tried setting uncorrected_df |> cyCombine::detect_batch_effect(batch_col = 'batch', out_dir = "...detect_batch_effect_5x5", xdim = 5, ydim = 5, norm_method = "scale", markers= F64LiveSinglet_sfc_markers_cleaned, seed =6157, name = 'F64LiveSinglet_spectral_uncorrected', downsample = NULL, binSize = 20) -> p

But, p returned NULL.

Do you have any suggestion?

Thank you so much again.

shdam commented 2 months ago

Hey,

I just merged an old branch that solved this into the dev branch. If you have the development version, you can return the plots by setting out_dir = NULL.

Best regards, Søren

denvercal1234GitHub commented 2 months ago

With my cyCombine v0.2.19.9, when I set out_dir=NULL in cyCombine::detect_batch_effect() and left the other arguments as above, it returned Error in dir.exists(dir.path) : invalid filename argument.

Also, do you know if it is possible to customise the MDS plots outputted from detect_batch_effect_express()? This would be very impactful because at the moment, it only colors the dots, but I have no way to tell which dot corresponds to which sample (I aim to see whether the "anchors" samples are grouped together)?

When I set out_dir=NULL in detect_batch_effect_express() and assign it to a variable, it returned Error in cyCombine::detect_batch_effect_express(F64LiveSinglet_channel_data.list_cell.dat, : object 'all_markers' not found, even though I had already ran non_markers <- tolower(c(cyCombine::non_markers, "id", "Time", "FileName", "FileNo","sample", "CellID","FileNameexportedSinglet", "batch", "condition", "anchor", "FSCA",....))

This is the columns of my uncorrected df (no dash, no space, with upper and lower cases):

Thank you Søren.

shdam commented 2 months ago

Hey,

Found the all_markers issue. Thanks for pointing me to that!

out_dir = NULL only works for cyCombine::detect_batch_effect_express().

Maybe you can do something like p[['MDS']] + aes(shape = "sample") Or something similar.

Best regards, Søren

denvercal1234GitHub commented 2 months ago

So, for the UMAP_batches_labels.png plot outputted from detect_batch_effect(), if we cannot use out_dir=NULL, do you know how might I split the plot by batch instead of having them "grouped" by batch?

For the MDS plots outputted from detect_batch_effect_express(), should I wait for your fix, because it still returned the same error.

Thank you again, Søren!

shdam commented 2 months ago

Hmm, for now it would probably be easiest to run the source code for detect_batch_effect() manually. But I will add it to the todo to make the plots exportable as well.

This works for me now with the latest version on GitHub:

p <- uncorrected |> detect_batch_effect_express(markers = markers)
p$MDS$data$condition <- md$condition[match(p$MDS$data$sample, md$sample_id)]
p$MDS + aes(shape = condition)

Best regards, Søren

denvercal1234GitHub commented 2 months ago

Thanks, Søren - With this dev version, do you know if we still need to do non_markers <- tolower(c(cyCombine::non_markers, "id", "Time", "FileName", "FileNo","sample", "CellID","FileNameexportedSinglet", "batch", "condition", "anchor", "FSCA",...?

shdam commented 2 months ago

For any version of cyCombine, you only have to define non_markers if you don't inform the functions which markers to use.

Since you set markers = F64LiveSinglet_sfc_markers_cleaned, you don't have to define non_markers, unless you use a cyCombine function to get F64LiveSinglet_sfc_markers_cleaned.

Hope this is helpful.

denvercal1234GitHub commented 2 months ago

Thank you Søren! I was able to get the MDS plot from detect_batch_effect_express to work, which is very useful to visualise batch effect.

Q7: Your intensities are so high that I would not put much faith in EMD until you transform your data - the EMD was tweaked for values between 0 - 6; yours are ~250 - 800.

Regarding the accuracy of EMD plots for spectral flow transformed with bi-exponential, when I kept the 0-1000 range of values, and plotted using cyCombine::detect_batch_effect_express(binSize = 20) or when I divided the values by 200 (now the values range from 0 - 6) then plotted using cyCombine::detect_batch_effect_express(binSize = 0.1), the EMD plot remains the same. That is the markers that show to have the highest mean EMD values do not seem to agree with the visualisation by the marker intensity distribution plots generated by this function, especially for CD4 and CD8B markers as shown below.

Do you have any suggestion to ensure the x-axis of all these histogram/density plots generated from cyCombine::detect_batch_effect_express()fixed from 0 - 6? I tried + scale_x_continuous(limits = c(0, 6)) and coord_cartesian(xlim = c(0, 6)), but they did not work.

Thank you again for all your help!

SamGG commented 2 months ago

Coming from the issue you filled in Spectre repo, but not having read the far too long discussion here: a) could you post a screenshot of the transform applied in FlowJo, I mean a screenshot showing the histogram tool when you click on the "T" near the axes of the CD4 and CD8? I think the values exported from FJ in CSV range from 0 to 1024, which is the legacy unit. In CD8A, I think the negative peak will at 1/4 of the FJ scale and the positive peak around 3/4 of the FJ scale. b) the histograms above show quite nicely aligned peaks. I don't feel the need to perform any batch correction, unless you show me a clear batch effect. EMD is earth moving distance. EMD is the same if we move 1 kg from 10 meters or 1000 kg from 0.01 meter. Although the energy is the same, I don't see any major effect in the 0.01 meter case whereas I see a clear effect in the 10 meters move. As usual, my two cents.

denvercal1234GitHub commented 2 months ago

Hi, @SamGG -- Thank you so much for taking time to give your input!

Below are 4 "rationales" that make me think the data would benefit from doing batch correction. Would love to hear you two's insights.

Thanks so much again!

Rationale 1. From the MDS plot, my 3 "anchor" or batch samples (which are the same PBMC sample stained in each of the 3 batches; my batch variable is day of experiment) did not group together. Batch 2 is separate to the left corner away from batch 1 and batch 3 to the right. This suggests there are markers that drive the differences between Batch 2 versus Batches 1 and 2.

These markers by manual visual inspection are likely the ones I marked in the EMD plot below; however, markers with high EMD do not always correspond to markers showing inconsistent expression by manual visual inspection (Rationales 2 - 4)

_NOTE that EMD as currently employed by cyCombine::detect_batch_effect_express() calculates the global EMD for all cells and not cluster-wise EMD. Global EMD was mentioned to assume the distribution of cell types is the same between batches,_ which in my case might not be the case (https://biosurf.org/cyCombine_benchmarking.html). To summarise my experimental procedure, I had 3 different tissues that I ran over 3 days (Batch_by_RunDay) for different donors. For each day of run (which is my expected batch variable), I included the same donor PBMC sample which is labelled as "anchor".

Rationale 2: These are the density plots produced by cyCombine::detect_batch_effect_express(binSize = 20) that by visually examining for all markers, I identified these markers seem to be the most different between batches, esp batch 2 (D2).

Note: In the histograms by FlowJo below, batch 1 is Red (D1), batch 2 is Orange (D2), and batch 3 Green (D3). The blue one is the cell/bead ref controls. This color scheme is not the same as the one in density plots by cyCombine.

Rationale 3: These are the histograms produced in FlowJo for the markers above.

Rationale 4: There are other markers that look quite consistent across 3 batches by the density plots by cyCombine but when visualised by FlowJo histogram, look different (esp. batch 2 vs batch 1 and batch 3):

Note: In the histograms by FlowJo, batch 1 is Red (D1), batch 2 is Orange (D2), and batch 3 Green (D3). The blue one is the cell/bead ref controls. This color scheme is not the same as the one in density plots by cyCombine.

SamGG commented 2 months ago

Thanks for your detailed and well-argued feedback. The FJ plots are normalized to the mode. This drastically increases the feeling of differences and is not correct IMO. I really don't understand how normalization to mode (i.e. to the highest peak in the distribution) could lead to the FJ graphs for marker C and D, especially for D2. cyCombine does not use it, and I trust the choice of Christina et al. What matters is not the differences/cells in the red boxes, but peak shift. We are looking at the control sample that might show the highest difference between batches. So look at the other samples, make your own opinion about the experiment as a whole. Experiments are not perfect, and we have to work with. I can't remember exactly what measurement the MDS plot is based on. I would probably opt for a PCA because PCA allows us to interpret the axes. In the end, you will find that one criterion of CytofRUV is the presence of a batch effect in the final analysis. Yes, it's not the best time to identify a batch effect, but better late than never. With the exception of the naked-eye peak alignment analysis, I don't rely blindly on any of the published metric. Stop overthinking and perform the analysis up to its end, unless you have already done so. If so, no idea.

shdam commented 2 months ago

I agree with Sam; it doesn't appear you have a lot of batch effects. A simple PCA can be very informative. I personally also like UMAP as an additional visual check.

As an additional point, you really shouldn't put much emphasis on EMD in cyCombine::detect_batch_effect_express, as it computes it globally, and the assumption of equal cell type proportions is invalid in your case. Therefore, the large EMD values are driven by different abundances of CD8 and CD4 T cells between batches, not batch effects.

denvercal1234GitHub commented 2 months ago

Thank you, both, again for your input and time to help me with this analysis. Your inputs have been very helpful!

Regarding the use of PCA, below is the PCA of 300k events separated by individual samples in 3 different tissues (in different colors). The red Tissue A boxes are the replicates (anchors) in the 3 batches (D1, D2, D3). Globally, it does not seem to be too different, you think?

I also computed the PCA on the sample-level with median expression of the same markers as above. From this PCA plot, it agrees with the above plot that the Tissues are different (which is what I expected), and it looks like there is variation driven by batch (the 3 shapes). Moreover, the 3 tissue A anchor samples do not aggregated together as observed also in MDS.

From these PCA, I think it is worth performing the batch correction and compares whether the batch correction helps at all, because am I correct @shdam that cyCombine correction is still quite useful to be done even with minimal batch effect? Would love to hear your thoughts! Thanks again both!

shdam commented 2 months ago

It is a good sign that the tissues are what mainly drives the PCA. cyCombine is quite good at retaining biological variation, so correcting small batch effects should not be a problem. It could be insightful to see the same plot after correction to confirm that it looks better. I suggest coloring after batch, as color is a little easier to eyeball, and batch is what you are concerned about at this stage.

Good luck with the analysis :)