al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
210 stars 96 forks source link

add 'chunk' option for getData #292

Open RJDan opened 1 year ago

RJDan commented 1 year ago

Hello I have had many issues trying to do custom analyses with the output of getData when the data sets are large.

I get a segmentation fault that does not necessarily go away with more RAM. For instance my LSF job scheduler reports that I only use 20 GB of ram when I get the Seg Fault error but I requested a total of 40 GB, so there is another 20 GB available and unused.

I want to request that getData function gets an option to read data in chunks, as the *DB functions currently have.

I know that I could change my code to read in the file in chunks (i.e. don't use getData) or split my files into chunks outside of R, but I think this would be super useful for many people who don't know how to do the more manual manipulation of files.

alexg9010 commented 1 year ago

Hi @RJDan,

Thank you for your bringing this back to our attention.

We already had a discussion about reading data in chunks, but back then we switched to reading the whole file because it was faster ( see #141 ). But now, with even more samples and larger tabix files, this seems less stable and possibly less reasonable.

Maybe, chunkwise reading will lower the memory pressure and prevent segmentation faults, but the whole purpose of getData is to read the whole dataset into memory, so it is questionable whether reading data in chunks will solve memory issues in the long run. Probably the best we can do here is to approximate possible memory consumption and warn/stop if this exceeds available memory.

I guess what you are looking for is some kind of streaming function to read a new chunk of data on every invocation, doing some calculations and repeating until the whole file is processed. We actually have this kind of functionality available internally to process tabix files in chunks, per chromosome or only for specific ranges. These functions are not exported and only available via the methylKit::: notations, but maybe they should be exposed and/or promoted more to allow for more customized analysis.

We are already using this functionality for memory efficient extraction of the percent methylation matrix or to perform coverage calculations prior to bigwig export.

RJDan commented 1 year ago

Thanks for the reply.

I would suggest making a more prominent statement on the website and manual about what does and does not work with 'chunks' for those who don't read things properly (such as myself).

Perhaps you could also make mention that if we encounter SegFault issues the first thing to check is that we are not using some function that reads everything into memory. I had a line in my script that created a non-DB object as an intermediate step and it took me a while to figure out what was causing the segfault. Maybe you could print to screen whether a function is reading by chunks or the whole file when the function is called (to make it easier to spot the issue).

Thanks for the pointers about the non-exposed functions.