Parallelised PrepData()?

ocastells commented 4 days ago

Hi,

I am dealing with a large CyTOF dataset (>100 million cells in well over 1500 files) that requires a lot of computational power to process. To cope with such dataset, I am trying to parallelise the prepData() function. Has this been done before? Maybe it could be of help for others as datasets are just growing bigger these days.

I am working in a linux terminal with 252Gb RAM, 32 CPU and I managed to improve the reading of my .fcs files, doing so in batches building a flowset_list as an output (code below). I am not a bioinformatician so please feel free to pitch in with useful feedback and good coding practices I may be missing.

#To parallelise read.FCS function
library(flowCore)
library(parallel)

fcs_files <- meta$FCSFilePath  # List of all file paths
batch_size <- 60  # 1500 divided by 25 batches (example)
file_batches <- split(fcs_files, ceiling(seq_along(fcs_files) / batch_size))

### Set up parallel workers
n_cores <- detectCores() - 1  # Leave one core for the OS
cl <- makeCluster(n_cores)

### Export necessary objects and libraries to workers
clusterExport(cl, varlist = c("file_batches", "read.flowSet"))
clusterEvalQ(cl, library(flowCore))

### Parallelize reading of batches
flowset_list <- parLapply(cl, file_batches, function(batch) { #parLapply is an ok option but mclapply is simpler
  read.flowSet(batch, transformation = FALSE, truncate_max_range = FALSE)
})

lapply(flowset_list, sampleNames)  # List sample names from each batch

### Stop the cluster
stopCluster(cl)

The _flowsetlist now goes into the modified NewPrepData(flowset_list, meta, md_col...) which keeps the same structure as original PrepData() but trying to speed up the generation of exprs matrix:

NewPrepData <- function(flowset_list, panel = NULL, md = NULL, 
                              features = NULL, transform = TRUE, cofactor = 5,
                              panel_cols = list(channel = "fcs_colname", 
                                                antigen = "antigen", class = "marker_class"),
                              md_cols = list(
                                file = "file_name", id = "sample_id", 
                                factors = c("condition", "patient_id")),
                              by_time = TRUE, FACS = FALSE) {
  cat("PrepData: reading flowSet", "\n")

  ### generate the fs object required for sanity checks in the original code
  fs <- Reduce(rbind2, flowset_list)

  if (is.null(fs) || length(fs) == 0)
    stop("No valid flowSets could be read.")
  cat("PrepData: flowSet read successfully", "\n")

### original code untouched up until generating expression matrix:
# Get exprs. using parallelization
  num_cores <- detectCores() - 1  # Use all available cores minus one

  ### Convert the flowSet to a list of flowFrames
  fs_list <- flowSet_to_list(fs)

  ###Extract expression matrices in parallel
  es_list <- mclapply(fs_list, function(frame) {
    exprs(frame)  # Extract the expression matrix
  }, mc.cores = num_cores)

  ### Combine the individual matrices into a single matrix
  es <- do.call(cbind, lapply(es_list, t))

  rm(fs_list, es_list)
  cat("PrepData: exprs matrix successfully generated", "\n")

### No changes further below

This code has worked nicely for 40 million cells (downsampled test to debug code) but for well over 100 million cells and more files I am not sure this is the most efficient way of using memory and CPUs (process gets slower and needs to use swap memory in linux terminal). Do any of you have some ideas to improve the efficiency of the current code or in general to optimise prepData() for bigger datasets?

SamGG commented 4 days ago

Hi, My personal view on processing a large dataset is that we don't have enough time to process it. tSNE or UMAP are quite long to perform, and even FlowSOM is getting long with millions of cells (and be careful with the "totalIter bug"). So, in the end, you will downsample your dataset to a reasonable size, let's say of few millions of cells. So my view is to downsample first in order to get a result using CATALYST pipeline. Once we got clusters, if we really want to exploit the whole dataset, we can map each cell of every FCS to the cluster centers determined by FlowSOM (there is a function for that in FlowSOM), then extract the features (abundances, marker MFI) and finally perform a statistical analysis. Let's see @HelenaLC point of view... Best.

ocastells commented 3 days ago

Hi Sam, Thanks for your reply! I agree that clustering too many cells doesn't add much extra value for the cases that you have a lot of cells per file. In my case, because we tested several in vitro conditions per patient sample, I have 50-100 million cells per file, so unfortunately I cannot downsample. Best, Oriol

HelenaLC commented 3 days ago

Hi there! To begin with, I am a bit "over-questioned", as we say in German. So just my thoughts bullet-style...

I'm analyzing mostly transcriptomics data these days, and there, I would never realize a dense matrix of millions of cells, ever.
So then, these are represented as sparse matrices, but that also has its limits and won't do for CyTOF, where there are virtually no 0s.
So then, I rely on a delayed backend (e.g., .h5), so that I can work with millions of cells and 10k+ features (even on my laptop), because things are only realized (in chunks) when needed.

...that said, there was a couple related issues in the past, specifically #272 comes to mind, where https://github.com/RGLab/ncdfFlow was suggested, which provides HDF5-based storage for cytometry data. Perhaps worth checking this out because, in the long run, memory + parallelization will run their course eventually as there's more and more data.

...#403 also mentions how data could be processed in batches. As far as I know, previous colleagues of mine analyze data over months, 100s of patients, using standardized workflows where new samples are added and analyzed periodically.

...also, if you check out flowCores docs, even if any one of your files contains millions of cells, you can still downsample there via which.lines: "Numeric vector to specify the indices of the lines to be read. If NULL all the records are read, if of length 1, a random sample of the size indicated by which.lines is read in."

...one last thought: I have never tried this with CyTOF data, but it should certainly be possible to loop through files/subsets there of and write them out as delayed from the SingleCellExperiment, e.g., using HDF5Array. That is what I do when things get too big (read the raw stuff, write out to .h5, and work from there using DelayedMatrix). In any case, FlowSOM and other methods might struggle, so much of the analyses (beyond reading) for these many cells needs consideration.

Hope any of this is useful, and sorry there's no easy answer here... My feeling is that this is largely because tools developed back in the day didn't have current data sizes in mind. Similarly, the scRNA-seq field has changed a lot, so that file formats and analysis tools are catching up with data demands continuously. Your data sounds like it's pretty high up there.

ocastells commented 8 hours ago

Hi Helena, Thank you so much for your feedback here! There is no easy answer when file number escalates so much but you brought up very interesting threads to look into.

I think the DelayedMatrix is a nice way to move forward, I will investigate the HDF5Array (ncdfFlow may be the most straight-forward way to do so). Efficient use of memory was the main purpose of my issue here, so thanks again!

SamGG commented 7 hours ago

Hi,

which.lines is not the solution to go as "Be aware the potential slow read (especially for the large size of random sampling) due to the frequent disk seek operations.". It's nice when the FCS file does not fit in the memory, which has never be the case IMHO. I prefer to read the whole FCS into a flowframe, downsample and remove the flowframe.

ncdf/HDF is nice to handle/access objects bigger than the real memory (RAM), but I don't see how to achieve an efficient SOM or UMAP without the effective dataset in RAM.

IMHO, you should better think about the next step: how many cells will you put in the SOM and UMAP steps?

HelenaLC / CATALYST

Parallelised PrepData()? #413