Open ocastells opened 4 days ago
Hi, My personal view on processing a large dataset is that we don't have enough time to process it. tSNE or UMAP are quite long to perform, and even FlowSOM is getting long with millions of cells (and be careful with the "totalIter bug"). So, in the end, you will downsample your dataset to a reasonable size, let's say of few millions of cells. So my view is to downsample first in order to get a result using CATALYST pipeline. Once we got clusters, if we really want to exploit the whole dataset, we can map each cell of every FCS to the cluster centers determined by FlowSOM (there is a function for that in FlowSOM), then extract the features (abundances, marker MFI) and finally perform a statistical analysis. Let's see @HelenaLC point of view... Best.
Hi Sam, Thanks for your reply! I agree that clustering too many cells doesn't add much extra value for the cases that you have a lot of cells per file. In my case, because we tested several in vitro conditions per patient sample, I have 50-100 million cells per file, so unfortunately I cannot downsample. Best, Oriol
Hi there! To begin with, I am a bit "over-questioned", as we say in German. So just my thoughts bullet-style...
...that said, there was a couple related issues in the past, specifically #272 comes to mind, where https://github.com/RGLab/ncdfFlow was suggested, which provides HDF5-based storage for cytometry data. Perhaps worth checking this out because, in the long run, memory + parallelization will run their course eventually as there's more and more data.
...#403 also mentions how data could be processed in batches. As far as I know, previous colleagues of mine analyze data over months, 100s of patients, using standardized workflows where new samples are added and analyzed periodically.
...also, if you check out flowCore
s docs, even if any one of your files contains millions of cells, you can still downsample there via which.lines
: "Numeric vector to specify the indices of the lines to be read. If NULL all the records are read, if of length 1, a random sample of the size indicated by which.lines is read in."
...one last thought: I have never tried this with CyTOF data, but it should certainly be possible to loop through files/subsets there of and write them out as delayed from the SingleCellExperiment
, e.g., using HDF5Array
. That is what I do when things get too big (read the raw stuff, write out to .h5, and work from there using DelayedMatrix
). In any case, FlowSOM
and other methods might struggle, so much of the analyses (beyond reading) for these many cells needs consideration.
Hope any of this is useful, and sorry there's no easy answer here... My feeling is that this is largely because tools developed back in the day didn't have current data sizes in mind. Similarly, the scRNA-seq field has changed a lot, so that file formats and analysis tools are catching up with data demands continuously. Your data sounds like it's pretty high up there.
Hi Helena, Thank you so much for your feedback here! There is no easy answer when file number escalates so much but you brought up very interesting threads to look into.
I think the DelayedMatrix is a nice way to move forward, I will investigate the HDF5Array (ncdfFlow may be the most straight-forward way to do so). Efficient use of memory was the main purpose of my issue here, so thanks again!
Hi,
which.lines is not the solution to go as "Be aware the potential slow read (especially for the large size of random sampling) due to the frequent disk seek operations.". It's nice when the FCS file does not fit in the memory, which has never be the case IMHO. I prefer to read the whole FCS into a flowframe, downsample and remove the flowframe.
ncdf/HDF is nice to handle/access objects bigger than the real memory (RAM), but I don't see how to achieve an efficient SOM or UMAP without the effective dataset in RAM.
IMHO, you should better think about the next step: how many cells will you put in the SOM and UMAP steps?
Hi,
I am dealing with a large CyTOF dataset (>100 million cells in well over 1500 files) that requires a lot of computational power to process. To cope with such dataset, I am trying to parallelise the prepData() function. Has this been done before? Maybe it could be of help for others as datasets are just growing bigger these days.
I am working in a linux terminal with 252Gb RAM, 32 CPU and I managed to improve the reading of my .fcs files, doing so in batches building a flowset_list as an output (code below). I am not a bioinformatician so please feel free to pitch in with useful feedback and good coding practices I may be missing.
The _flowsetlist now goes into the modified NewPrepData(flowset_list, meta, md_col...) which keeps the same structure as original PrepData() but trying to speed up the generation of exprs matrix:
This code has worked nicely for 40 million cells (downsampled test to debug code) but for well over 100 million cells and more files I am not sure this is the most efficient way of using memory and CPUs (process gets slower and needs to use swap memory in linux terminal). Do any of you have some ideas to improve the efficiency of the current code or in general to optimise prepData() for bigger datasets?