adigenova / wengan

An accurate and ultra-fast hybrid genome assembler
GNU Affero General Public License v3.0
84 stars 14 forks source link

make: *** [asm1.SPolished.asm.wengan.fasta] Error 134 #47

Closed mictadlo closed 3 years ago

mictadlo commented 3 years ago

Hi, I tried to assemble a 3GB allotetraploid plant genome. However, it crashed on 5.5 GB of RAM.

export MALLOC_PER_THREAD=1
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/DiscovarExp READS="1740D-43-06_S0_L001_R1.fastq.gz,1740D-43-06_S0_L001_R2.fastq.gz" OUT_DIR=/tmp/asm1D NUM_THREADS=14 2> asm1.Disco_denovo.err > asm1.Disco_denovo.log
cp -a /tmp/asm1D asm1D
ln -s asm1D/a.final/a.lines.fasta asm1.contigs-disco.fa
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk cutN -n 1   asm1.contigs-disco.fa | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk seq -L 200 -  | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk iupac2bases -  | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk rename - D | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk seq -l 60 - > asm1.contigs.disco.fa
[L::iupac2bases] A total of 0 bases were changed.
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg shortr -c 50 -k 21 -w 10 -q 20 -r 50000 -t 14 asm1.contigs.disco.fa asm1.fms.txt 2>asm1.fms.err >asm1.fms.log
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -I 500,1000,2000  -t 14 asm1.contigs.disco.fa asm1.fml.im.txt 2>asm1.fml.im.err >asm1.fml.im.log
rm -f longreads.asm1.im.1.fa
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/intervalmiss -d 1 --clib 1  -t 14 -s asm1.fms.sams.txt -c  asm1.contigs.disco.fa -p  asm1 2>asm1.im.err >asm1.im.log
grep ">" asm1.MBC1.msplit.fa | sed 's/>//' | awk '{print $1" "$2}' | sed 's/>//g' > asm1.MBC1.msplit.cov.txt
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -t  14 -p asm1 -I  500,1000,2000,3000,4000,5000,6000,7000,8000,10000,15000,20000 asm1.MBC1.msplit.fa asm1.fml.txt 2>asm1.fml.err >asm1.fml.log
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/liger  --mlp 10000  --mit 20000000  -t 14 -c  asm1.MBC1.msplit.fa -l  longreads.asm11.fa -d asm1.MBC1.msplit.cov.txt -p asm1 -s asm1.sams.txt 2>asm1.liger.err >asm1.liger.log
/bin/sh: line 1: 271749 Aborted                 (core dumped) /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/liger --mlp 10000 --mit 20000000 -t 14 -c asm1.MBC1.msplit.fa -l longreads.asm11.fa -d asm1.MBC1.msplit.cov.txt -p asm1 -s asm1.sams.txt 2> asm1.liger.err > asm1.liger.log
asm1.mk:50: recipe for target 'asm1.SPolished.asm.wengan.fasta' failed
make: *** [asm1.SPolished.asm.wengan.fasta] Error 134

How is it possible to reduce the memory requirements?

Thank you in advance,

Michal

adigenova commented 3 years ago

Hi Michal,

Can you please upload the following log files, to check what's going on:

  1. asm1.fml.err
  2. asm1.fml.log
  3. asm1.liger.err
  4. asm1.liger.log

Best, Alex

mictadlo commented 3 years ago

Hi Alex, Please find attached the requested files asm1.gz

Thank you in advance

Michal

adigenova commented 3 years ago

HI Michal,

I checked the logs and Wengan stopped because the long-read sequences contains N or another invalid character, Wengan give the following message:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid DNA base found in DnaBitset class

You can replace the invalid character in the long-read sequences using my seqtk fork and the command:

seqtk iupac2basesA long-reads.fastq.gz > long-reads.clean.fq.gz

Then you can resume the assembly, but using the cleaned long-read file.

Best, Alex

adigenova commented 3 years ago

Feel free to reopen if you have further questions!. Best, Alex

mictadlo commented 3 years ago

Hi Alex, I used 3 Tb of memory and it was still not enough. Is there a way to reduce the amount of memory?

export MALLOC_PER_THREAD=1
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/DiscovarExp READS="1740D-43-06_S0_L001_R1.fastq.gz,1740D-43-06_S0_L001_R2.fastq.gz" OUT_DIR=/tmp/asm1D NUM_THREADS=8 2> asm1.Disco_denovo.err > asm1.Disco_denovo.log
/bin/sh: line 1: 153941 Killed                  /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/DiscovarExp READS="1740D-43-06_S0_L001_R1.fastq.gz,1740D-43-06_S0_L001_R2.fastq.gz" OUT_DIR=/tmp/asm1D NUM_THREADS=8 2> asm1.Disco_denovo.err > asm1.Disco_denovo.log
asm1.mk:4: recipe for target 'asm1.contigs-disco.fa' failed
make: *** [asm1.contigs-disco.fa] Error 137
PBS Job 9607252.pbs
CPU time  : 18:48:50
Wall time : 03:38:03
Mem usage : 3145728000kb
> cat asm1.mk
.DELETE_ON_ERROR:
#Wengan automatic generated makefile
asm1.contigs-disco.fa : 
    export MALLOC_PER_THREAD=1
    /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/DiscovarExp READS="1740D-43-06_S0_L001_R1.fastq.gz,1740D-43-06_S0_L001_R2.fastq.gz" OUT_DIR=/tmp/asm1D NUM_THREADS=8 2> asm1.Disco_denovo.err > asm1.Disco_denovo.log
    cp -a /tmp/asm1D asm1D
    ln -s asm1D/a.final/a.lines.fasta asm1.contigs-disco.fa

asm1.contigs.disco.fa : asm1.contigs-disco.fa
    /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk cutN -n 1   asm1.contigs-disco.fa | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk seq -L 200 -  | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk iupac2bases -  | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk rename - D | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk seq -l 60 - > asm1.contigs.disco.fa

asm11.fm.sam : asm1.contigs.disco.fa
    @echo asm11 1740D-43-06_S0_L001_R1.fastq.gz 1740D-43-06_S0_L001_R2.fastq.gz  >  asm1.fms.txt
    /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg shortr -c 50 -k 21 -w 10 -q 20 -r 50000 -t 8 asm1.contigs.disco.fa asm1.fms.txt 2>asm1.fms.err >asm1.fms.log

asm1.im.1.I1000.fm.sam : asm1.im.1.I500.fm.sam
asm1.im.1.I2000.fm.sam : asm1.im.1.I1000.fm.sam
asm1.im.1.I500.fm.sam : asm11.fm.sam
    @echo asm1.im.1 allPacBio-ONT.clean.fasta.gz  >  asm1.fml.im.txt
    /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -I 500,1000,2000  -t 8 asm1.contigs.disco.fa asm1.fml.im.txt 2>asm1.fml.im.err >asm1.fml.im.log
    -rm -f longreads.asm1.im.1.fa

asm1.MBC1.msplit.fa : asm1.contigs.disco.fa asm11.fm.sam asm1.im.1.I500.fm.sam asm1.im.1.I1000.fm.sam asm1.im.1.I2000.fm.sam
    @echo "asm11.fm.sam 0"  >  asm1.fms.sams.txt
    @echo "asm1.im.1.I500.fm.sam    1"  >>  asm1.fms.sams.txt
    @echo "asm1.im.1.I1000.fm.sam   1"  >>  asm1.fms.sams.txt
    @echo "asm1.im.1.I2000.fm.sam   1"  >>  asm1.fms.sams.txt
    /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/intervalmiss -d 1 --clib 1  -t 8 -s asm1.fms.sams.txt -c  asm1.contigs.disco.fa -p  asm1 2>asm1.im.err >asm1.im.log

asm1.MBC1.msplit.cov.txt : asm1.MBC1.msplit.fa
    grep ">" asm1.MBC1.msplit.fa | sed 's/>//' | awk '{print $$1" "$$2}' | sed 's/>//g' > asm1.MBC1.msplit.cov.txt

asm11.I500.fm.sam : longreads.asm11.fa
asm11.I1000.fm.sam : asm11.I500.fm.sam
asm11.I2000.fm.sam : asm11.I1000.fm.sam
asm11.I3000.fm.sam : asm11.I2000.fm.sam
asm11.I4000.fm.sam : asm11.I3000.fm.sam
asm11.I5000.fm.sam : asm11.I4000.fm.sam
asm11.I6000.fm.sam : asm11.I5000.fm.sam
asm11.I7000.fm.sam : asm11.I6000.fm.sam
asm11.I8000.fm.sam : asm11.I7000.fm.sam
asm11.I10000.fm.sam : asm11.I8000.fm.sam
asm11.I15000.fm.sam : asm11.I10000.fm.sam
asm11.I20000.fm.sam : asm11.I15000.fm.sam
longreads.asm11.fa : asm1.MBC1.msplit.fa asm1.MBC1.msplit.cov.txt
    @echo asm11 allPacBio-ONT.clean.fasta.gz  >  asm1.fml.txt
    /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -t  8 -p asm1 -I  500,1000,2000,3000,4000,5000,6000,7000,8000,10000,15000,20000 asm1.MBC1.msplit.fa asm1.fml.txt 2>asm1.fml.err >asm1.fml.log

asm1.SPolished.asm.wengan.fasta : asm1.MBC1.msplit.fa longreads.asm11.fa asm11.I500.fm.sam asm11.I1000.fm.sam asm11.I2000.fm.sam asm11.I3000.fm.sam asm11.I4000.fm.sam asm11.I5000.fm.sam asm11.I6000.fm.sam asm11.I7000.fm.sam asm11.I8000.fm.sam asm11.I10000.fm.sam asm11.I15000.fm.sam asm11.I20000.fm.sam
    @echo asm11.I500.fm.sam  >  asm1.sams.txt
    @echo asm11.I1000.fm.sam  >>  asm1.sams.txt
    @echo asm11.I2000.fm.sam  >>  asm1.sams.txt
    @echo asm11.I3000.fm.sam  >>  asm1.sams.txt
    @echo asm11.I4000.fm.sam  >>  asm1.sams.txt
    @echo asm11.I5000.fm.sam  >>  asm1.sams.txt
    @echo asm11.I6000.fm.sam  >>  asm1.sams.txt
    @echo asm11.I7000.fm.sam  >>  asm1.sams.txt
    @echo "asm11.I8000.fm.sam   8000"  >>  asm1.sams.txt
    @echo "asm11.I10000.fm.sam  10000"  >>  asm1.sams.txt
    @echo "asm11.I15000.fm.sam  15000"  >>  asm1.sams.txt
    @echo "asm11.I20000.fm.sam  20000"  >>  asm1.sams.txt
    /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/liger  --mlp 10000  --mit 20000000  -t 8 -c  asm1.MBC1.msplit.fa -l  longreads.asm11.fa -d asm1.MBC1.msplit.cov.txt -p asm1 -s asm1.sams.txt 2>asm1.liger.err >asm1.liger.log

all : asm1.SPolished.asm.wengan.fasta
clean :
    -rm -f asm1.contigs-disco.fa asm1.contigs.disco.fa asm11.fm.sam asm1.im.1.I500.fm.sam asm1.im.1.I1000.fm.sam asm1.im.1.I2000.fm.sam asm1.MBC1.msplit.fa asm1.MBC1.msplit.cov.txt longreads.asm11.fa asm11.I500.fm.sam asm11.I1000.fm.sam asm11.I2000.fm.sam asm11.I3000.fm.sam asm11.I4000.fm.sam asm11.I5000.fm.sam asm11.I6000.fm.sam asm11.I7000.fm.sam asm11.I8000.fm.sam asm11.I10000.fm.sam asm11.I15000.fm.sam asm11.I20000.fm.sam asm1.SPolished.asm.wengan.fasta
> cat asm1.Disco_denovo.log
Performing re-exec to adjust stack size.

--------------------------------------------------------------------------------
Sat Jun 12 03:22:39 2021 run on cl4n001, pid=153941 [Nov  5 2019 06:43:49 R51885 ]
DiscovarExp                                                                    \
            READS="1740D-43-06_S0_L001_R1.fastq.gz,1740D-43-06_S0_L001_R2.fast \
            q.gz" OUT_DIR=/tmp/asm1D NUM_THREADS=8
--------------------------------------------------------------------------------

Sat Jun 12 03:22:39 2021: Warning: recommend doing 'setenv MALLOC_PER_THREAD 1'
Sat Jun 12 03:22:39 2021: before Discovar, to improve computational performance.

INPUT FILES:
[1a,type=frag,sample=C,lib=1,frac=1] 1740D-43-06_S0_L001_R1.fastq.gz
[1b,type=frag,sample=C,lib=1,frac=1] 1740D-43-06_S0_L001_R2.fastq.gz

Sat Jun 12 06:27:13 2021: using 1,094,669,784 reads
Sat Jun 12 06:27:13 2021: data extraction complete
3.08 hours used extracting reads
Sat Jun 12 06:27:13 2021: see total physical memory of 6,222,716,518,400 bytes
Sat Jun 12 06:27:13 2021: 37.65 bytes per read base, assuming max memory available
We need 1 passes.
Expect 2582287126 keys per batch.
Provide 3204227598 keys per batch.

Have I done anything wrong?

adigenova commented 3 years ago

Hi Michal,

The recommended coverage for DiscovarDenovo is 60X, if you have more coverage you need to subsample the short reads. I guess from the number that you have about 56X of short-read coverage, which is fine, but you need to set the MAX_MEM_GB to 500GB or 600GB of DiscoVarDenovo, otherwise, it will try to use the MAX memory available in the machine which seems to be 6Tb. To set the max memory you have to edit manually the makefile file generated by Wengan (prefix.mk file) and then run the pipeline with:

make -f prefix.mk all

Hope that this help,

Best Alex