alekseyzimin / masurca

GNU General Public License v3.0
236 stars 35 forks source link

k-unitigs with k=31 not finishing #179

Open olivierarmant opened 4 years ago

olivierarmant commented 4 years ago

Dear masurca developpers, dear all First of all, thank you very much for developing this very nice suite, very usefull for the community :) I encounter a very long lag time of more that 15 days at the step creating unitgs, and would like to know if this is normal or not. Is it normal, or should I kill this job with less mate pairs data?

Your feedback would be much appreciated! Olivier A

Here is the log: armant-oli@cadux-genomik:/TMPLOCAL/olivier_data/Seq_nottrimmed/masurca$ cat nohup.out [Thu Jun 11 13:47:49 CEST 2020] Processing pe library reads [Thu Jun 11 17:20:33 CEST 2020] Processing sj library reads [Thu Jun 11 19:14:48 CEST 2020] Average PE read length 175 [Thu Jun 11 19:14:49 CEST 2020] Using kmer size of 67 for the graph [Thu Jun 11 19:14:49 CEST 2020] MIN_Q_CHAR: 33 [Thu Jun 11 19:14:50 CEST 2020] Creating mer database for Quorum [Thu Jun 11 21:58:17 CEST 2020] Error correct PE [Fri Jun 12 03:56:09 CEST 2020] Error correct JUMP [Fri Jun 12 06:22:37 CEST 2020] Estimating genome size [Fri Jun 12 08:52:17 CEST 2020] Estimated genome size: 5938548948 [Fri Jun 12 08:52:17 CEST 2020] Creating k-unitigs with k=67 [Fri Jun 12 18:36:45 CEST 2020] Creating k-unitigs with k=31 => k-unitgs running for more than 15 days

And here is the assembl.sh file, and the commands reached for now

!/bin/bash

assemble.sh generated by masurca

CONFIG_PATH="/TMPLOCAL/olivier_data/Seq_nottrimmed/masurca/config.mascura" CMD_PATH="/TMPLOCAL/MaSuRCA-3.4.1/bin/masurca" set -o pipefail

Test that we support <() redirection

(eval "cat <(echo test) >/dev/null" 2>/dev/null) || { echo >&2 "ERROR: The shell used is missing important features." echo >&2 " Run the assembly script directly as './$0'" exit 1 }

Parse command line switches

while getopts ":rc" o; do case "${o}" in c) echo "configuration file is '$CONFIG_PATH'" exit 0 ;; r) echo "Rerunning configuration" exec perl "$CMD_PATH" "$CONFIG_PATH" echo "Failed to rerun configuration" exit 1 ;; *) echo "Usage: $0 [-r] [-c]" exit 1 ;; esac done set +e

Set some paths and prime system to save environment variables

save () { (echo -n "$1=\""; eval "echo -n \"\$$1\""; echo '"') >> environment.sh } GC= RC= NC= if tty -s < /dev/fd/1 2> /dev/null; then GC='\e[0;32m' RC='\e[0;31m' NC='\e[0m' fi log () { d=$(date) echo -e "${GC}[$d]${NC} $@" } fail () { d=$(date) echo -e "${RC}[$d]${NC} $@" exit 1 } signaled () { fail Interrupted } trap signaled TERM QUIT INT rm -f environment.sh; touch environment.sh

To run tasks in parallel

run_bg () {

semaphore -j $NUMTHREADS --id masurca$$ -- "$@"

}

run_wait () {

semaphore -j $NUMTHREADS --id masurca$$ --wait

}

export PATH="/TMPLOCAL/MaSuRCA-3.4.1/bin/../CA8/Linux-amd64/bin:/TMPLOCAL/MaSuRCA-3.4.1/bin:$PATH" save PATH export PERL5LIB=/TMPLOCAL/MaSuRCA-3.4.1/bin/../lib/perl${PERL5LIB:+:$PERL5LIB} save PERL5LIB NUM_THREADS=96 save NUM_THREADS log 'Processing pe library reads' rm -rf meanAndStdevByPrefix.pe.txt echo 'p1 350 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p1' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p1.renamed.fastq' echo 'p2 550 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p2' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p2.renamed.fastq' echo 'p3 350 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p3' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p3.renamed.fastq' echo 'p4 550 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p4' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p4.renamed.fastq' log 'Processing sj library reads' rm -rf meanAndStdevByPrefix.sj.txt echo 'm1 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm1' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm1.renamed.fastq' echo 'm2 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm2' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm2.renamed.fastq' echo 'm3 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm3' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm3.renamed.fastq' echo 'm4 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm4' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm4.renamed.fastq' echo 'm5 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm5' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm5.renamed.fastq' echo 'm6 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm6' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm6.renamed.fastq'

head -q -n 40000 p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq | grep --text -v '^+' | grep --text -v '^@' > pe_data.tmp export PE_AVG_READ_LENGTH=awk '{if(length($1)>31){n+=length($1);m++;}}END{print int(n/m)}' pe_data.tmp save PE_AVG_READ_LENGTH log Average PE read length $PE_AVG_READ_LENGTH KMER=for f in p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq;do head -n 80000 $f |tail -n 40000;done | perl -e 'while($line=<STDIN>){$line=<STDIN>;chomp($line);push(@lines,$line);for($i=0;$i<6;$i++){$line=<STDIN>;}}$min_len=100000;$base_count=0;foreach $l(@lines){$base_count+=length($l);push(@lengths,length($l));@f=split("",$l);foreach $base(@f){if(uc($base) eq "G" || uc($base) eq "C"){$gc_count++}}} @lengths =sort {$b <=> $a} @lengths; $min_len=$lengths[int($#lengths*.75)]; $gc_ratio=$gc_count/$base_count;$kmer=0;if($gc_ratio>=0.35 && $gc_ratio<=0.6){$kmer=int($min_len*.66);}else{$kmer=int($min_len*.33);} $kmer++ if($kmer%2==0); $kmer=31 if($kmer<31); $kmer=127 if($kmer>127); print $kmer' save KMER log Using kmer size of $KMER for the graph KMER_J=31 save KMER_J MIN_Q_CHAR=cat p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq |head -n 50000 | awk 'BEGIN{flag=0}{if($0 ~ /^\+/){flag=1}else if(flag==1){print $0;flag=0}}' | perl -ne 'BEGIN{$q0_char="@";}{chomp;@f=split "";foreach $v(@f){if(ord($v)<ord($q0_char)){$q0_char=$v;}}}END{$ans=ord($q0_char);if($ans<64){print "33\n"}else{print "64\n"}}' save MIN_Q_CHAR log MIN_Q_CHAR: $MIN_Q_CHAR JF_SIZE=ls -l *.fastq | awk '{n+=$5}END{s=int(n/50); if(s>60000000000)printf "%.0f",s;else print "60000000000";}' save JF_SIZE perl -e '{if(int('$JF_SIZE')>60000000000){print "WARNING: JF_SIZE set too low, increasing JF_SIZE to at least '$JF_SIZE', this automatic increase may be not enough!\n"}}' log Creating mer database for Quorum awk '{print substr($0,1,200)}' p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq | quorum_create_database -t 96 -s $JF_SIZE -b 7 -m 24 -q $((MIN_Q_CHAR + 5)) -o quorum_mer_db.jf.tmp /dev/stdin && mv quorum_mer_db.jf.tmp quorum_mer_db.jf if [ 0 != 0 ]; then fail Increase JF_SIZE in config file, the recommendation is to set this to genome_size*coverage/2 fi

log Error correct PE

quorum_error_correct_reads -q $(($MIN_Q_CHAR + 40)) --contaminant=/TMPLOCAL/MaSuRCA-3.4.1/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 96 -w 10 -e 3 -M quorum_mer_db.jf p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq --no-discard -o pe.cor.tmp --verbose 1>quorum.err 2>&1 && mv pe.cor.tmp.fa pe.cor.fa || fail Error correction of PE reads failed. Check pe.cor.log.

log Error correct JUMP

quorum_error_correct_reads -q $(($MIN_Q_CHAR + 40)) --contaminant=/TMPLOCAL/MaSuRCA-3.4.1/bin/../share/adapter.jf -m 1 -s 1 -g 2 -a 3 -t 96 -w 10 -e 3 -M quorum_mer_db.jf m1.renamed.fastq m2.renamed.fastq m3.renamed.fastq m4.renamed.fastq m5.renamed.fastq m6.renamed.fastq --no-discard -o sj.cor.tmp --verbose 1>quorum.err 2>&1 && mv sj.cor.tmp.fa sj.cor.fa || fail Error correction of JUMP reads failed. Check sj.cor.log.

if [ -s ESTIMATED_GENOME_SIZE.txt ];then ESTIMATED_GENOME_SIZE=head -n 1 ESTIMATED_GENOME_SIZE.txt else log Estimating genome size jellyfish count -m 31 -t 96 -C -s $JF_SIZE -o k_u_hash_0 pe.cor.fa sj.cor.fa export ESTIMATED_GENOME_SIZE=jellyfish histo -t 96 -h 1 k_u_hash_0 | tail -n 1 |awk '{print $2}' echo $ESTIMATED_GENOME_SIZE > ESTIMATED_GENOME_SIZE.txt fi save ESTIMATED_GENOME_SIZE log "Estimated genome size: $ESTIMATED_GENOME_SIZE"

log Creating k-unitigs with k=$KMER create_k_unitigs_large_k -c $(($KMER-1)) -t 96 -m $KMER -n $(($ESTIMATED_GENOME_SIZE2)) -l $KMER -f perl -e 'print 1/'$KMER'/1e5' pe.cor.fa sj.cor.fa | grep --text -v '^>' | perl -ane '{$seq=$F[0]; $F[0]=~tr/ACTGactg/TGACtgac/;$revseq=reverse($F[0]); $h{($seq ge $revseq)?$seq:$revseq}=1;}END{$n=0;foreach $k(keys %h){print ">",$n++," length:",length($k),"\n$k\n"}}' > guillaumeKUnitigsAtLeast32bases_all.fasta.tmp && mv guillaumeKUnitigsAtLeast32bases_all.fasta.tmp guillaumeKUnitigsAtLeast32bases_all.fasta if [[ $KMER -eq $KMER_J ]];then ln -s guillaumeKUnitigsAtLeast32bases_all.fasta guillaumeKUnitigsAtLeast32bases_all.jump.fasta else log Creating k-unitigs with k=$KMER_J create_k_unitigs_large_k -c $(($KMER_J-1)) -t 96 -m $KMER_J -n $(($ESTIMATED_GENOME_SIZE2)) -l $KMER_J -f perl -e 'print 1/'$KMER_J'/1e5' pe.cor.fa sj.cor.fa | grep --text -v '^>' | perl -ane '{$seq=$F[0]; $F[0]=~tr/ACTGactg/TGACtgac/;$revseq=reverse($F[0]); $h{($seq ge $revseq)?$seq:$revseq}=1;}END{$n=0;foreach $k(keys %h){print ">",$n++," length:",length($k),"\n$k\n"}}' > guillaumeKUnitigsAtLeast32bases_all.jump.fasta.tmp && mv guillaumeKUnitigsAtLeast32bases_all.jump.fasta.tmp guillaumeKUnitigsAtLeast32bases_all.jump.fasta fi

alekseyzimin commented 3 years ago

What is your genome size expectation? MaSuRCA estimated 6Gbp. Please make sure you are not using too much coverage in Illumina data -- over 100x is excessive.