biosurf / cyCombine

Robust Integration of Single-Cell Cytometry Datasets
Other
23 stars 6 forks source link

detect_batch_effect is not working when I do not use your transformation option #47

Open KCanziani opened 4 months ago

KCanziani commented 4 months ago

Hi,

I hope you can help me with this. I am trying to analyze flow cytometry data acquired in a Cytek Aurora. I have 230 files, but I am using only 12 files from 4 different batches to see if I can optimize the code. When I prepare_data and choose transform= TRUE, detect_batch_effect works perfectly (takes 2 min to run),

uncorrected <- prepare_data( data_dir = file.path(directory, data_dir), metadata = metadata_file, filename_col = "memory_cells_file", batch_ids = "Run", down_sample = TRUE, markers = markers, seed = seed, transform = TRUE, cofactor = 1000, derand = FALSE )

but if I previously transform the data and use prepare_data with transform= FALSE, it never finishes running:

uncorrected <- prepare_data( data_dir = file.path(directory, data_dir), metadata = metadata_file, filename_col = "memory_cells_file", batch_ids = "Run", down_sample = TRUE, markers = markers, seed = seed, transform = FALSE, derand = FALSE ) Output

detect_batch_effect(uncorrected, batch_col = 'batch', out_dir = paste0(data_dir, '/batch_effect_check'), seed = seed, name = 'GMAP') Determining new cell type labels using SOM: Creating SOM grid.. Scaling expression data.. Error: vector memory exhausted (limit reached?)

I compared the input files for detect_batch_effect, which are tibbles of the same size and variables. The transformation I did is the same as you do, but I need to use a different cofactor for each channel. That is why I can't use your function. Thank you in advance.

shdam commented 4 months ago

Hi Karina,

Thank you for using cyCombine!

Does the problem persist if you give the markers to detect_batch_effect?

I ask because the transformation step will subset to the markers you give. If you don't give markers to detect_batch_effect, it will try to guess which they are, which is fine when you subset after transformation but can be a problem otherwise.

Let me know if it is still running slow :)

I will add marker-specific cofactors to the list to implement.

Other suggestions for optimization: You can provide labels when batch correcting, which means you can cluster using FlowSOM on a flowSet, which might be more memory efficient. You can also batch-correct each label group separately, so you don't need all data in memory at once. These are optimizations I would like to implement when I get the time.

Best of luck with your analysis! :)

Best regards, Søren

KCanziani commented 4 months ago

Hi Søren,

Thank you for getting back to me so quickly. I gave the markers or label_col to detect_batch_effect and the problem persists. I even reduced the number of samples to 3 and it is the same (each file has around 50 thousand cells). If I transform during prepare_data, it runs in 1 or 2 min, but if I don't, it never finishes. I do not what else I can try. If you can implement a marker-specific cofactor, it will be really useful.

Edit: I was able to run it on a more powerful computer with only 3 samples and I realized that the markers parameter in detect_batch_effect is not doing anything. It analyzes all the channels. I don't know if label_col is working either. So, I subsetted the channels before doing the detect_batch_effect to avoid the problem with the markers parameter, but it is still running much lower compared to when I did the transformation with your function.

detect_batch_effect(uncorrected_2, batch_col = 'batch', out_dir = paste0(data_dir, '/batch_effect_check'), seed = seed, name = 'GMAP', markers = selected_markers, label_col = "label")

Thank you so much.

Best, Karina

shdam commented 4 months ago

Hi Karina,

I am sorry to hear it is still causing issues - but thank you for the added details!

I find it odd the markers parameter doesn't seem to work. Which version of cyCombine are you using?

I think the problem is the range of values between your markers. detect_batch_effect uses the Earth Mover's Distance to estimate batch effects, which bins your expression values from min to max with a binsize of 0.1. This is fine when values are not above 10, but your more fine-tuned cofactor transformations might return larger values. What is the distribution of min-max values of your marker expression with your transformation compared to using our function?

If that turns out to be the issue, I recommend either re-evaluating your cofactors or skipping the detect_batch_effects step - it has no implications for the batch correction but is simply a quantitative measure of the level of batch effects. I will look into making the binsize an input parameter, so you can adjust that in the future.

I hope this helps you find the issue :)

Best regards, Søren

shdam commented 2 months ago

Hi Karina,

Did you have any luck running detect_batch_effect?

I added binSize to compute_emd long ago, but forgot to include it in detect_batch_effect. But I did that now on the dev branch. If you'd like, you can now play around with the binSize setting there, in case that is helpful.

Let me know if you are still facing issues.

Best regards, Søren

denvercal1234GitHub commented 2 months ago

Hi @shdam - I wonder if you guys might have a tutorial script somewhere for people (a lot, esp transitioning from CytoNorm for example) to adapt cyCombine for flow cytometry data?

shdam commented 2 months ago

Hey @denvercal1234GitHub,

Thanks for your question!

I suggest giving our vignettes a look - especially the ones using Spectral Flow data. Regular flow would be similar, but a cofactor of around 150 would be more appropriate. Also note that the vignettes do a clustering step after prepare_data. You can just go straight to the batch correction after prepare_data. That said, what you would need to change from the code in the github README is the cofactor = 150 and derand = FALSE in prepare_data. The vignettes add some additional descriptions, but with those two changes, it should work.

Thanks for checking out cyCombine! And let me know if you have any questions :)

Best regards, Søren

denvercal1234GitHub commented 2 months ago

Thank you @shdam!

Might I ask for clarification regarding the arguments for spectral flow in prepare_data() and clustering steps before batch correction?

My data is a spectral flow data of 3 different tissue types from different donors. I ran them in 3 different days; each day has 2 samples from a tissue type and 2 different donors. In each day, I also included the same PBMC sample as reference for batch correction.

My aim is to analyze differences between tissues and assess donor-to-donor variation. So, I want to batch align the technical difference = day_of_run (i.e., D1, D2, D3).

I am following this tutorial, specifically the spectral flow part: https://biosurf.org/cyCombine_Spectralflow_CyTOF.html

Q1. I specified batch_ids = "Batch_by_RunDay", but what should I specified for condition? Or I can specify whatever for condition that in case I want to include that as covariate for batch alignment?

Q2. Similarly, it looks like sample_ids is donor ID. Specifying it here just gives info and not necessarily used automatically for alignment, right? In my case, donor variability is biological variation which I do not want to align.

Q3. In my case, as described, I have a PBMC as reference batch control, should I specify ref.batch argument in cyCombine::batch_correct in addition to batch_ids = "Batch_by_RunDay" in prepare_data()? If so, can you give an example of the input for this argument?

Q4. Each sample has different number of cells. Do you think I should downsample to the smallest number of cells for batch alignment and later clustering? If I do, I was thinking of setting down_sample = TRUE, sampling_type = "sample_ids", and setting sample_ids = "File_Name" because my File_Name is unique for each sample (whereas Donor_ID would not be as I had the same PBMC sample as ref sample). I suppose when sampling_type = "sample_ids", it will just ignore sample_size = 5e+0.5?

Q5. What is the advantage of doing the clustering before batch correction?

Q6. Is there any guideline to select values for kohonen::som (i.e., xdim, dim, glen, and dist.fcts) for spectral flow data clustering?

F64CD8_spectral <- cyCombine::prepare_data(data_dir = data_dir,
                         metadata = paste0("", "/F64_UnmixedV3_Sample_Info_noNA.xlsx"),
                         filename_col = "File_Name_exportedCD8_channel",
                         batch_ids = "Batch_by_RunDay",
                         condition = "Tissue",
                         sample_ids = "File_Name",
                         cofactor = 6000, transform = FALSE, 
                         derand = FALSE,
                         markers = F64CD8_sfc_markers,
                         down_sample = TRUE, sample_size = 5e+0.5, sampling_type = "sample_ids")

Thank you so much!

denvercal1234GitHub commented 2 months ago

Updated regarding Q4. If I set down_sample = TRUE, sample_size = 5e+0.5, sampling_type = "sample_ids" as above, the running log seems to actually take 5e+05 cells per sample -- which is not what I want. So, what I needed to do actually is to manually set the sample_size=min(F64_UnmixedV3_Sample_Info_noNA$eventCounts

shdam commented 1 month ago

Hey @denvercal1234GitHub, thanks for your questions!

Q1. The condition is indeed meant for the covariate you want to account for in the correction. In your case, since not all batches have all tissue types, you might run into cases where they confound with the batch. Try to include it as condition/covariate and see if it works out - you should receive messages if they confound, and if they do the covariate will be ignored in the correction. Alternatively, you can use the anchor argument, which is meant for your situation, although it isn't a fully evaluated implementation. It works, as described in the bottom of the GitHub README, by setting a variable that is equal for all your PBMC samples and different for all other samples. This can conveniently be done in a metadata file and provide the column name in prepare_data(anchor = "anchor_colname"). Anchor does work with other covariates, but often they confound, so it isn't a perfect solution - nothing beats having all covariates in all batches.

Q2. sample_id has no implications for correction. It is merely there to know which cells are from which donor/file/whatever.

Q3. ref.batch is a ComBat parameter that informs which batch to remain unchanged. So it is unrelated to your reference samples. In my experience, corrections perform better without using ref.batch, but it is there for those who want it.

Q4. Yes, it will take that number of cells per sample (or all cells if there are fewer), so what you suggest would be a way to sample equal number of cells from each sample. That said, I would not expect the low number of cells to play a huge role in the correct, as long as it is not in a unique condition of that batch.

Q5. It ensures that correction is done within similar-looking cells, thus greatly reducing the chances of removing biological differences. The downside is that similar cells might look very different between batches (if they are distinct technologies), but as we show in the vignettes those are still possible to integrate, but the xdim/ydim (i.e., number of clusters) play a huge role in the success there.

Q6. I suggest trying the default setting to see how it looks. If you aren't convinced, then lowering the xdim and ydim to 5 could improve performance. You cannot change dist.fcts as of now - let me know if you'd like to be able to. rlen usually works fine around 10, but kohonen::som defaults to 100, so anything inbetween is sensible - it improves clustering robustness and increases runtime a fair bit. I usually always use norm_method = "scale". Rarely is it outperformed by "rank".

I hope this helps answer your questions in a meaningful way. Let me know if you have further questions. And thanks for trying out cyCombine!

Best regards, Søren

denvercal1234GitHub commented 1 month ago

Hi, Søren,

Thanks so much for your responses!

Q1. Do you think it is best for my case to instead use following arguments for batch_correct? I set covar="anchor" and anchor=NULL. I did not want to include condition being a Tissue type as covar and use anchor="anchor" because tissue difference is actually my biological question as well as donor variation. I supposecovar="anchor"andanchor=NULLis the same ascovar=NULLandanchor="anchor"`?

Note: My anchor again is a PBMC sample in each batch (batch being day of run), and as you correctly pointed out, I do not have other tissue types in each batch, which would have been ideal (albeit practically challenging lol) to have.

corrected <- uncorrected %>% cyCombine::batch_correct(seed = seed, covar="anchor", markers=NULL, 
                xdim = 8,
                ydim = 8, rlen=50,
                norm_method = 'rank', 
                ties.method = 'average', anchor=NULL, label=NULL, ref.batch = NULL, method="ComBat")
shdam commented 1 month ago

Happy to help :)

As you say, covar and anchor are functionally identical - both are there mainly to highlight the utility. I generally prefer norm_method = 'scale', but rank should do you fine. If it can run, I would prefer covar = 'tissue' over covar = 'anchor' because that gives the model more power if they don't confound with batch. Setting tissue type as covariate ensures the biological differences are not mixed. Using the anchor instead will restrict the correction to solely be based on differences between your PBMC samples, which might be affected differently by the technical effects than other tissue types. Still, it's better than not specifying a covar, although the clustering should also account for this problem and excluding covariates could turn out fine - ComBat is quite good at retaining biological differences.

Bear in mind, with markers = NULL, all columns not in non_markers will be assumed to be a marker. You can add metadata column names with non_markers <- c(cyCombine::non_markers, "col_name") if need be.

Let me know how it goes :)

Best regards, Søren

denvercal1234GitHub commented 1 month ago

Thanks Søren.

Just to confirm I understood your thoughts -- you suggested trying setting covar = "Tissue" and anchor = "anchor" as below? The reason why I am unclear is I would have thought that if I set covar="Tissue", the function will try to adjust also by "Tissue" -- which it would not be what I want since one of my research aims is to assess differences between Tissues?

Could you also explain where do I specify the non_markers in cyCombine::batch_correct? Because you are right; when I examined corrected, my columns "FileNo" and "FileName" which are characters also got "corrected" lol ...

Thank you again for your help!

corrected <- uncorrected %>% cyCombine::batch_correct(seed = seed, covar="Tissue", markers=NULL, 
                xdim = 5,
                ydim = 5, rlen=50,
                norm_method = 'scale', 
                ties.method = 'average', anchor="anchor", label=NULL, ref.batch = NULL, method="ComBat")
shdam commented 1 month ago

covar works the opposite way as you are assuming. It informs which groups to retain biological differences between. It will likely not work alongside anchor, but you can try - read the messages it gives carefully to make sure it does what you expect. covar should be whatever column name contains tissue information.

You simply define non_markers before running the correction. No specifying. It could look like this:

non_markers <- c(cyCombine::non_markers, "FileNo", "FileName", "Tissue")
corrected <- uncorrected %>% cyCombine::batch_correct(seed = seed, covar="Tissue", markers=NULL, 
                xdim = 5,
                ydim = 5, rlen=50,
                norm_method = 'scale', 
                ties.method = 'average', anchor="anchor", label=NULL, ref.batch = NULL, method="ComBat")

If you want to make sure it uses the correct columns, you can get them with: cyCombine::get_markers(uncorrected)

denvercal1234GitHub commented 1 month ago

Thank you, Søren!

Q1. I had pre-gated the cells on singlets (using geometric gates, e.g., FSC) live CD3+ CD8+. Do you think I should specify here geometric channels, CD3, CD4, CD8 and Live-Dead in addition to the "id", "Time", "FileName", "sample", "batch", "condition", "anchor", etc (of the uncorrected) as non-markers so that these will not get corrected in cyCombine::batch_correct (and also in cyCombine::detect_batch_effect)?

And, what about "Comp-AF-A"? I used spectral flow unmixing algorithm with autofluorescence subtraction.

So like this:

non_markers <- c(cyCombine::non_markers, "id","FSC-A","FSC-H", "SSC-A", "SSC-B-A", "SSC-B-H", "SSC-H","Comp-AF-A","CD8", "Time","FileName", "FileNo", "sample", "batch", "condition", "anchor")

Or, is the recommendation that we should examine each marker (including markers we used to pre-gate the populations, except geometric markers, before inputting into R) and decide which markers to align, such as through cyCombine::detect_batch_effect? But I wondered does it make sense to only correct certain markers that are different between batches?

Q2. When examining and correcting for batch, should we just input in pre-gated singlets instead of pre-gated live CD3+ CD8+ cells as in my case?

Note that I am mostly interested in CD8 cells, and I will not be using CD3, CD8, Live-dead, and AF markers for clustering later as I would pre-gate CD8+ population as input into downstream clustering.

shdam commented 1 month ago

Thanks for your questions :)

Q1. I would say it is best to include all markers in the correction (except geometric ones) because they can help inform corrections for other markers. "sample", "batch", "condition", "anchor" are already in cyCombine::non_markers, so adding them is redundant.

Q2. Just input the cells you are interested in. If you expect low diversity in those cells, reduce x/ydim to cover a slightly bigger range than you expect. On the dev branch, you can also use FlowSOM metaclustering to achieve the number of clusters you expect and correct within these (using cluster_mode and nClus parameters). If you wish, you can also set label = "1", and it will correct all cells without clustering. Combat will try to retain biological differences (easier with covar = "Tissue/condition") - this is mostly if you expect very little heterogeneity. There is no need to exclude CD3 and CD8, as they help inform the correction.

I hope this was helpful :)

Best regards, Søren

denvercal1234GitHub commented 1 month ago

Thank you Søren.

Just a followup on Q2. Because CD8 T cells are my cells of interest, so I pregated on them as input for correction (instead of taking the whole singlets as input). But what happens if markers CD3, CD8, and Live-Dead were also different between batches, would pre-gating using these markers prevent the algorithm from identifying these markers as different between batches and therefore not adjusting them? Or, pregating would already take care of these marker if there were any misalignment between batches?

shdam commented 1 month ago

cyCombine doesn't know what you did. It will correct all markers you give it and all cells you give it, retaining biological differences to the best of its ability.
Pregating the cells only impacts the heterogeneity of your cells, making the clustering step less essential.

I hope this answers your question :)

Best regards, Søren

denvercal1234GitHub commented 1 month ago

Thanks Søren.

Follow up on Q1. After running

non_markers <- c(cyCombine::non_markers, "id", "Time", "FileName", "FileNo", "FSC-A", "FSC-H", "SSC-A", "SSC-B-A", "SSC-B-H", "SSC-H", "Comp-AF-A")

cyCombine::get_markers(F64Singlet_spectral) (before doing correct_batch()) still returned the geometric gates and Time, FileName, and FileNo. Is this expected behaviour?

F64Singlet_spectral is my uncorrected

Thank you again!!

shdam commented 1 month ago

Ah, that is a case problem I hadn't realized before. Thanks for pointing me to it! If you define non_markers in lowercase, it should work as expected, i.e.: non_markers <- tolower(c(cyCombine::non_markers, "id", "Time", "FileName", "FileNo", "FSC-A", "FSC-H", "SSC-A", "SSC-B-A", "SSC-B-H", "SSC-H", "Comp-AF-A"))

I will fix this on the dev branch soon.

Best regards, Søren