HelenaLC / CATALYST

Cytometry dATa anALYsis Tools
67 stars 31 forks source link

normCytof failed for large number of samples #403

Open jeffsun905 opened 3 weeks ago

jeffsun905 commented 3 weeks ago

Hi, We are trying to process over 150 samples and it failed at normCytof (critial step to make them comparable as the data came from multiple batches) with a segmentaiton error. Our guess is out of memery or something related. Is there any limit how many samples CATALYST can run and any suggestion to run a large project like this?

SamGG commented 3 weeks ago

You are problably trying to normalize the whole set at once. If you choose a reference FCS, the process just has to normalize against that reference, meaning that only 2 FCS are loaded in memory at the same time. The vignette is just advertizing how to use the functions. Instead of starting from scratch (or the vignette), you'll save a lot of energy by examining the (perhaps too many) published workflows such as https://github.com/prybakowska/CyTOF_analysis_Pipeline1/blob/master/pipeline.R at line 16 (and associated functions) or https://github.com/prybakowska/CytoQP/blob/master/CytoQP_script.R. Alternatively, you may use the nearly original beads normalization given at https://biosurf.org/cytof_data_scientist.html#313_Performing_bead_normalization. HTH

jeffsun905 commented 3 weeks ago

Thank you very much for the hints. Yes, I was trying to do that. The function worked fine for ~50 samples so I thouhgt it should be fine as our system has ~1T memory. While working on the suggested pipleine doing one vs one normalization, I also tested spliting the big data into small batches and then provide one batch at time to normCytof function with a commone reference sample specificed. To my surprise, the line plot is very different from the one without specifing a reference (i.e., all provided samples together, here i just tested 5 samples) as the lines "after" are all close to 0 while the other one has the simiar values as the "before". I would greatly appreciate your insights about the differences. Line plot with a reference specified: image Line plot without reference: image

HelenaLC commented 3 weeks ago

Sth looks off in the 1st plot- could you provide relevant code how you did the normalization using a fixed references for split batches?

jeffsun905 commented 3 weeks ago

Here are the codes:

  1. refsam <- "fcs/myfcs1.fcs"
  2. sce.test5 <- prepData(c("fcs/myfcs1.fcs","fcs/myfcs2.fcs","fcs/myfcs3.fcs","fcs/myfcs4.fcs","fcs/myfcs5.fcs"),channelfile, md, transform = TRUE, truncate_max_range = FALSE) #here the channelfile has three columns of "antigen" "fcs_colname" "marker_class"; the md file has "file_name" "sample_id" "condition" "patient_id" "batch" "age" "sex"
  3. mynorm.ref <- normCytof(sce.test5, beads = "dvs", k = 50, norm_to = refsam, assays = c("counts", "exprs"), overwrite = TRUE, plot = T)
  4. mynorm.noref <- normCytof(sce.test5, beads = "dvs", k = 50, assays = c("counts", "exprs"), overwrite = TRUE, plot = T) The first plot is from line 3 with a reference fcs specified. The second is from line 4 (we have done this before and it looks pretty normal). Sessioninfo (only relavant ones): R version 4.3.2 (2023-10-31) CATALYST_1.26.1
jeffsun905 commented 2 weeks ago

Are you able to replicate the issue? I suspected when the reference was specified, somehow the first figure may have used transformed data. I looked a bit more at the function and there are two lines of code to get "bl" depending on if a reference is provided. The scale of the two is very different (like 100 times, 2k vs 20 something), which might explain the odd line plot. Does this affect the one vs one normalization suggested from the first response post (https://github.com/prybakowska/CyTOF_analysis_Pipeline1/blob/master/pipeline.R), which seems using the function as well. Thank you for looking into this!