cmayer / MitoGeneExtractor

The MitoGeneExtractor can be used to extract protein coding mitochondrial genes, such as COI and others from short and long read sequencing libraries.
GNU Affero General Public License v3.0
6 stars 3 forks source link

MGE memory-related errors - 'Aborted (core dumped)' & 'Segmentation fault (core dumped)' #14

Open SchistoDan opened 1 month ago

SchistoDan commented 1 month ago

Hi,

Firstly, thanks for developing a brilliant and easy to use tool! Apologies in advance for the information dump.

I've been developing a snakemake pipeline (initially based on your example) to take genome skims from thousands of museum specimens. Currently we're running the pipeline on a benchmarking dataset of 570 samples. I've enabled MGE to take specific protein references for each sample and run with multiple ('r' and 's') parameter combinations, which can result in MGE producing 3,420 consensus (and associated) files.

As these are museum specimens, we were also interested in screening for possible 'contaminant' reads, and as MGE can run on a multi-fasta and reads will map to the 'closest' reference, we've been trying to provide MGE with a multi-fasta for each sample that contains the sample-specific reference and 14 common contaminant references (fungi, human, etc). When running MGE on the 570 samples with even one parameter combination, this can result in 8,550 MGE runs/jobs (570*15)!

When running this version of the pipeline, the MGE step consistently crashes with 'Aborted' and 'Segmentation fault' output to the snakemake log (e.g. line 4,014 for Segmentation fault MGE-standard_r1s50_contam.txt). This crash happens at different times in the run (isn't always the first MGE job). I can 'resume' the snakemake run post-crash using '--rerun-incomplete' but the same sample will consistently cause the crash within each run, although in an identical run (in a different directory) a different sample will cause the crash (implying it's likely not a sample-specific issue). The input files into MGE (concatenated and trimmed PE fastq files) varying between ~50 MB and 15 GB (most are 1-3 GB), but it's not always the MGE jobs using the larger files that cause the crashes. I'm running the pipeline on a HPC node with 192 CPUs and 2 TB RAM available. I've tried requesting more of less CPUs and over 1 TB of RAM for the run but it doesn't seem to impact whether the run crashes. Our system administrator seems to think it's not an out of memory issue on our end.

When looking at the MGE logs, some of the jobs that report 'Aborted' and 'Segmentation fault' output 'munmap_chunk(): invalid pointer', 'corrupted size vs. prev_size while consolidating' or 'realloc(): invalid old size', which I believe are C/C++ memory leak/inefficiency-related issues, but I'm not familiar with C languages.

Interestingly, when the MGE step crashes on a particular sample, all of the alignment and consensus files for the 'contaminant' references are produced for that sample (or at least aren't deleted), but are not for the target reference.

Any advice on how to overcome these issues would be greatly appreciated as we intend to scale up our analysis further (>10,000 samples). Just let me know if you'd like any further information or files from me!

Many thanks, Dan

cmayer commented 1 month ago

Dear Dan,

many thanks for this detailed description. Did you find out, whether it crashed in MGE or in exonerate? Normally, MGE catches exonerate crashes. In our tests, crashes occured only in exonerate and as far I can recall never when there was sufficient memory. There is one exception: I recently found out that either exonerate or MGE crashes on long read data form PacBio. I did not have the time to investigate this furhter I have to admit. Are you using any kind of long read data?

Would it be possible for you to provide me with some of the data to test my program, so I can do some tests here. I guaratee to keep the data confidential.

Best wishes Christoph

SchistoDan commented 1 month ago

Dear Christoph,

It's not always easy to tell exactly, as if an MGE job fails for a particular sample then all the files (other than .out and .err log files) are deleted by snakemake.

Those memory allocation/C-related errors are sometimes output to the MGE .err log file (e.g. containing "NOTE: Exonerate hit skipped due to low relative alignment score: etc", but in many cases nothing is output other than the MGE rule error in the snakemake log (such as in the log I shared previously).

As I'm specifying >1TB of memory for a run, maybe the only solution is to reduce the number of samples per run in order to limit the amount of memory Exonerate takes?

All our data is PE Illumina data. As all our data is/will be public, I'm happy to share it with you. You will be able to download 190 samples of our benchmarking data here: https://drive.google.com/drive/folders/15Nt54mAxbZObBTps452oqUEDTW7J5OYr

I've also provided the Snakefile and all other necessary files to run the pipeline here in case useful in trying to recreate the errors - mge_snakemake_pipeline.tar.gz - you'll need to edit the paths in the config, samples.csv, and protein_references.csv, however.

Interestingly, it seems I can run multiple parameter combinations using one reference per sample, which can run MGE several thousand times, but when I run default parameters with multiple references per sample, MGE/Exonerate can crash. Is there a difference in the way MGE deals with different parameters VS a multi-fasta of several references per sample (e.g. are multiple parameters run sequentially, whereas multiple references are run in parallel, thereby overwhelming MGE/Exonerate/memory requirements)?

Many thanks for the help. I look forward to hearing whether the error(s) are reproduced on your side.

Best, Dan