Memory usage with large datasets

NicolasRichard44 commented 5 years ago

Hi,

I am trying to use chromVAR package with 10x public dataset. I want to use the fragments.tsv files but it is a very large file and it always runs out of memory. I am using R on a cluster with more than 28 cores and 100 Go RAM. So I don't understand how can I import these kinds of large datasets ? I have tried everything on bpparam. I have only succeeded to import the matrix with the first 200000 rows. But the matrix has approximately 189 million rows in total. I'm not even close to have a glance at my data. I also have tried with the Bam file by changing the "CB" into "RG". But it doesn't work either, the bam file is over 40 Go. The error message is either "cannot allocate vector of size ... Gb" or Rsession just abort and I have a core dump file. So is this package suitable for using really large dataset ? What kind of data was it designed to use ?

Nicolas

AliciaSchep commented 5 years ago

Hi @NicolasRichard44 the package was written before the 10x method was released or 10x-scale data was available. I think some people have used it to analyze 10x data (see #44 thread). How many peaks are you using? What is the import method? There are likely some strategies (e.g. pre-filtering reads in peaks prior, using only high confidence peaks) etc that could help. Also open to contributions to the package that help it scale better! (e.g. using something like DelayedArray)

NicolasRichard44 commented 5 years ago

Hi @AliciaSchep,

Thanks for your reply. I just wanted to know if I could make my data resemble the data the package was designed to use. I understand that very large datasets can be a problem, especially in R. Actually I have found what was going wrong with public 10k PBMC 10x dataset. The count worked with the first 200000 rows of the fragments.tsv file but it was already creating something like 24000 columns (one for each barcode). I wondered how it was possible because there were supposedly only 8728 relevant barcodes after filtering. So it means that in the fragments.tsv they leave all barcodes, even barcodes with < 1000 fragments (threshold is around 3700 fragments to consider a barcode as a single cell). It represents more than 527000 unique barcodes ! And that's why it couldn't create a matrix with 527000 columns without running out of memory. The solution is simple and maybe everyone already knew it. I just merged the filtered barcodes dataframe with the fragments.tsv dataframe to keep only fragments associated with relevant barcodes. The count step still takes a lot of time, around 2 and a half hours, but it works. And all the downstream steps are also working. I would like to contribute to help it scale better but unfortunately I don't have any experience in programming in R at the moment. Thank you for sharing this very useful package. It will help me a lot analyzing my 10x scATAC-seq dataset.

Nicolas

GreenleafLab / chromVAR

Memory usage with large datasets #53