HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
75 stars 25 forks source link

killed while loading SAM into Memory #158

Open Leo-Sprengel opened 2 years ago

Leo-Sprengel commented 2 years ago

Hi, Im trying to run the phyloFlash.pl script on single-read Data basepair length 75.

The provided testcase to check if installation was successful is running without issuses. -env test approves set up System specs: 16GiB System Memory AMD Ryzen 5 5500U with Radeon

phyloFlash.pl -lib run01 -read1 MP1_S30_R1_001.fastq.gz -readlength 75
This is phyloFlash v3.4

[21:44:58] Using dbhome '/home/leo/138.1'
[21:44:58] working on library run01
[21:44:58] Forward reads MP1_S30_R1_001.fastq.gz
[21:44:58] Running in single ended mode
[21:44:58] Current operating system linux
[21:44:58] Checking for required tools.
[21:44:58] Using nhmmer found at
       "/home/leo/anaconda3/envs/pf/lib/phyloFlash/barrnap-HGV/binaries/linux/nhmmer".
[21:44:58] Using grep found at "/usr/bin/grep".
[21:44:58] Using mafft found at "/home/leo/anaconda3/envs/pf/bin/mafft".
[21:44:58] Using barrnap found at
       "/home/leo/anaconda3/envs/pf/lib/phyloFlash/barrnap-HGV/bin/barrnap_HGV".
[21:44:58] Using fastaFromBed found at
       "/home/leo/anaconda3/envs/pf/bin/fastaFromBed".
[21:44:58] Using plotscript_SVG found at
       "/home/leo/anaconda3/envs/pf/lib/phyloFlash/phyloFlash_plotscript_svg.pl".
[21:44:58] Using spades found at
       "/home/leo/anaconda3/envs/pf/bin/spades.py".
[21:44:58] Using cat found at "/usr/bin/cat".
[21:44:58] Using sed found at "/home/leo/anaconda3/envs/pf/bin/sed".
[21:44:58] Using bbmap found at "/home/leo/anaconda3/envs/pf/bin/bbmap.sh".
[21:44:58] Using vsearch found at
       "/home/leo/anaconda3/envs/pf/bin/vsearch".
[21:44:58] Using awk found at "/usr/bin/awk".
[21:44:58] Using reformat found at
       "/home/leo/anaconda3/envs/pf/bin/reformat.sh".
[21:44:58] All required tools found.
[21:44:58] filtering reads with SSU db using minimum identity of 70%
[21:44:58] running subcommand:
       /home/leo/anaconda3/envs/pf/bin/bbmap.sh fast=t minidentity=0.7
       -Xmx10g reads=-1 threads=12 po=f outputunmapped=f
       path=/home/leo/138.1 out=run01.bbmap.sam
       outm=run01.MP1_S30_R1_001.fastq.gz.SSU.1.fq noheader=t
       ambiguous=all build=1 in=MP1_S30_R1_001.fastq.gz
       bhist=run01.basecompositionhistogram ihist=run01.inserthistogram
       idhist=run01.idhistogram scafstats=run01.hitstats overwrite=t
       2>run01.bbmap.out
[22:14:10] done...
[22:14:10] Reading SAM file run01.bbmap.sam into memory
Killed

dmesg shows that oom killed it.

[Sa Jan 15 22:14:40 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=perl,pid=57464,uid=1000
[Sa Jan 15 22:14:40 2022] Out of memory: Killed process 57464 (perl) total-vm:10282140kB, anon-rss:10259012kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:20148kB oom_score_adj:0

My read file is 242,4 MB big. The SAM File which should get opened is 2.7 GB big. Is the Size of the SAM File the issue here? I found in the file phyloFlash.pl in line 1133 a function called "open_or_die" does this cause the killing?(not familiar with perl)

Does anyone have recommendations what to do ?

Leo-Sprengel commented 2 years ago

I could fix the issue adjusting the -id parameter to 90% My guess is that 70% id was to unspecific thus created to many hits? Does this make sense?

HRGV commented 2 years ago

I think you are correct, with your high number of input reads the number of reads recruited at 70% ID is too high for emirge to handle. With 90% you are covering much of the sequence space the emirge works with, so that is a very good solution.