normCytof failed for large number of samples

jeffsun905 commented 2 months ago

Hi, We are trying to process over 150 samples and it failed at normCytof (critial step to make them comparable as the data came from multiple batches) with a segmentaiton error. Our guess is out of memery or something related. Is there any limit how many samples CATALYST can run and any suggestion to run a large project like this?

SamGG commented 2 months ago

You are problably trying to normalize the whole set at once. If you choose a reference FCS, the process just has to normalize against that reference, meaning that only 2 FCS are loaded in memory at the same time. The vignette is just advertizing how to use the functions. Instead of starting from scratch (or the vignette), you'll save a lot of energy by examining the (perhaps too many) published workflows such as https://github.com/prybakowska/CyTOF_analysis_Pipeline1/blob/master/pipeline.R at line 16 (and associated functions) or https://github.com/prybakowska/CytoQP/blob/master/CytoQP_script.R. Alternatively, you may use the nearly original beads normalization given at https://biosurf.org/cytof_data_scientist.html#313_Performing_bead_normalization. HTH

jeffsun905 commented 2 months ago

Thank you very much for the hints. Yes, I was trying to do that. The function worked fine for ~50 samples so I thouhgt it should be fine as our system has ~1T memory. While working on the suggested pipleine doing one vs one normalization, I also tested spliting the big data into small batches and then provide one batch at time to normCytof function with a commone reference sample specificed. To my surprise, the line plot is very different from the one without specifing a reference (i.e., all provided samples together, here i just tested 5 samples) as the lines "after" are all close to 0 while the other one has the simiar values as the "before". I would greatly appreciate your insights about the differences. Line plot with a reference specified: Line plot without reference:

HelenaLC commented 2 months ago

Sth looks off in the 1st plot- could you provide relevant code how you did the normalization using a fixed references for split batches?

jeffsun905 commented 2 months ago

Here are the codes:

refsam <- "fcs/myfcs1.fcs"
sce.test5 <- prepData(c("fcs/myfcs1.fcs","fcs/myfcs2.fcs","fcs/myfcs3.fcs","fcs/myfcs4.fcs","fcs/myfcs5.fcs"),channelfile, md, transform = TRUE, truncate_max_range = FALSE) #here the channelfile has three columns of "antigen" "fcs_colname" "marker_class"; the md file has "file_name" "sample_id" "condition" "patient_id" "batch" "age" "sex"
mynorm.ref <- normCytof(sce.test5, beads = "dvs", k = 50, norm_to = refsam, assays = c("counts", "exprs"), overwrite = TRUE, plot = T)
mynorm.noref <- normCytof(sce.test5, beads = "dvs", k = 50, assays = c("counts", "exprs"), overwrite = TRUE, plot = T) The first plot is from line 3 with a reference fcs specified. The second is from line 4 (we have done this before and it looks pretty normal). Sessioninfo (only relavant ones): R version 4.3.2 (2023-10-31) CATALYST_1.26.1

jeffsun905 commented 1 month ago

Are you able to replicate the issue? I suspected when the reference was specified, somehow the first figure may have used transformed data. I looked a bit more at the function and there are two lines of code to get "bl" depending on if a reference is provided. The scale of the two is very different (like 100 times, 2k vs 20 something), which might explain the odd line plot. Does this affect the one vs one normalization suggested from the first response post (https://github.com/prybakowska/CyTOF_analysis_Pipeline1/blob/master/pipeline.R), which seems using the function as well. Thank you for looking into this!

HelenaLC / CATALYST

normCytof failed for large number of samples #403