BIMSBbioinfo / pigx_chipseq

Pipeline for Analysis of ChIP-Seq data
http://bioinformatics.mdc-berlin.de/pigx/
GNU General Public License v3.0
11 stars 10 forks source link

Do we need to keep the scorematrices ? #67

Closed alexg9010 closed 6 years ago

alexg9010 commented 6 years ago

The step knit report requires a way too large amount of memory, because the RDS created at Extract_Signal_Annotation (https://github.com/BIMSBbioinfo/pigx_chipseq/blob/master/scripts/Extract_Signal_Annotation.R#L63) keeps the 11 scorematrixlist objects for every genomic annotation with a scorematrix for every sample in lsml list, even though the profile signal is already summarized in profiles tibble. The same lsml lists is passed to Summarize_Data_For_Report, which leads to the large memory footprint.

In my example analysis I have 16 samples and the lstats$Extract_Signal_Annotation$lsml alone is 28G, without actually beeing used.

al2na commented 6 years ago

Can we store stuff in HDF or some on disk stuff ? Maybe that's to do for genomation as well

On Sat, Apr 7, 2018 at 9:45 PM, Alexander Gosdschan < notifications@github.com> wrote:

The step knit report requires a way too large amount of memory, because the RDS created at Extract_Signal_Annotation (https://github.com/ BIMSBbioinfo/pigxchipseq/blob/master/scripts/Extract Signal_Annotation.R#L63)keeps the 11 scorematrixlist objects for every genomic annotation with a scorematrix for every sample in lsml list, even though the profile signal is already summarized in profiles tibble. The same lsml lists is passed to Summarize_Data_For_Report, which leads to the large memory footprint.

In my example analysis I have 16 samples and the lstats$ExtractSignal Annotation$lsml alone is 28G, without actually beeing used.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BIMSBbioinfo/pigx_chipseq/issues/67, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9ESucGnr13qTMMef6c4T7S6-uYEKiks5tmRdFgaJpZM4TLOT7 .

frenkiboy commented 6 years ago

The score matrices are not needed - I included them because they are always nice to have for downstream analysis, and there was a plan to have a multi heat matrix in the report - but this does not scale with the number of experiments. It seems it's faster calculate them when needed than to have them prepared, especially for a large number of samples. I think you can comfortably set the sml object to NULL, and everything should go much faster.

al2na commented 6 years ago

test