alekseyzimin / masurca

GNU General Public License v3.0
243 stars 35 forks source link

memory usage #41

Open kavonrtep opened 6 years ago

kavonrtep commented 6 years ago

Hi I am using masurca (v3.2.6) assembler for assembly of plat with 1Gbp genome and I can observer quite high memory usage in command assembly.sh in line:

create_k_unitigs_large_k -c $(($KMER-1)) -t 32 -m $KMER -n $(($ESTIMATED_GENOME_SIZE*2)) -l $KMER -fperl -e 'print 1/'$KMER'/1e5'pe.cor.fa | grep --text -v '^>' | perl -ane '{$seq=$F[0]; $F[0]=~tr/ACTGactg/TGACtgac/;$revseq=reverse($F[0]); $h{($seq ge $revseq)?$seq:$revseq}=1;}END{$n=0;foreach $k(keys %h){print ">",$n++," length:",length($k),"\n$k\n"}}' > guillaumeKUnitigsAtLeast32bases_all.fasta.tmp && mv guillaumeKUnitigsAtLeast32bases_all.fasta.tmp guillaumeKUnitigsAtLeast32bases_all.fasta

specifically seccond perl command:

perl -ane '{$seq=$F[0]; $F[0]=~tr/ACTGactg/TGACtgac/;$revseq=reverse($F[0....

is using over 750G RAM, is this memory usage in this step normal? Do I have to use server with more RAM? Or is the a way how to decrease memory usage? Best regards, Petr

alekseyzimin commented 6 years ago

Hi, this step should not use that much RAM. What is your coverage of the genome? I do not recommend using more than 100x of Illumina data.