Dear masurca developpers, dear all
First of all, thank you very much for developing this very nice suite, very usefull for the community :)
I encounter a very long lag time of more that 15 days at the step creating unitgs, and would like to know if this is normal or not. Is it normal, or should I kill this job with less mate pairs data?
Your feedback would be much appreciated!
Olivier A
Here is the log:
armant-oli@cadux-genomik:/TMPLOCAL/olivier_data/Seq_nottrimmed/masurca$ cat nohup.out
[Thu Jun 11 13:47:49 CEST 2020] Processing pe library reads
[Thu Jun 11 17:20:33 CEST 2020] Processing sj library reads
[Thu Jun 11 19:14:48 CEST 2020] Average PE read length 175
[Thu Jun 11 19:14:49 CEST 2020] Using kmer size of 67 for the graph
[Thu Jun 11 19:14:49 CEST 2020] MIN_Q_CHAR: 33
[Thu Jun 11 19:14:50 CEST 2020] Creating mer database for Quorum
[Thu Jun 11 21:58:17 CEST 2020] Error correct PE
[Fri Jun 12 03:56:09 CEST 2020] Error correct JUMP
[Fri Jun 12 06:22:37 CEST 2020] Estimating genome size
[Fri Jun 12 08:52:17 CEST 2020] Estimated genome size: 5938548948
[Fri Jun 12 08:52:17 CEST 2020] Creating k-unitigs with k=67
[Fri Jun 12 18:36:45 CEST 2020] Creating k-unitigs with k=31
=> k-unitgs running for more than 15 days
And here is the assembl.sh file, and the commands reached for now
!/bin/bash
assemble.sh generated by masurca
CONFIG_PATH="/TMPLOCAL/olivier_data/Seq_nottrimmed/masurca/config.mascura"
CMD_PATH="/TMPLOCAL/MaSuRCA-3.4.1/bin/masurca"
set -o pipefail
Test that we support <() redirection
(eval "cat <(echo test) >/dev/null" 2>/dev/null) || {
echo >&2 "ERROR: The shell used is missing important features."
echo >&2 " Run the assembly script directly as './$0'"
exit 1
}
Parse command line switches
while getopts ":rc" o; do
case "${o}" in
c)
echo "configuration file is '$CONFIG_PATH'"
exit 0
;;
r)
echo "Rerunning configuration"
exec perl "$CMD_PATH" "$CONFIG_PATH"
echo "Failed to rerun configuration"
exit 1
;;
*)
echo "Usage: $0 [-r] [-c]"
exit 1
;;
esac
done
set +e
Set some paths and prime system to save environment variables
What is your genome size expectation? MaSuRCA estimated 6Gbp. Please make sure you are not using too much coverage in Illumina data -- over 100x is excessive.
Dear masurca developpers, dear all First of all, thank you very much for developing this very nice suite, very usefull for the community :) I encounter a very long lag time of more that 15 days at the step creating unitgs, and would like to know if this is normal or not. Is it normal, or should I kill this job with less mate pairs data?
Your feedback would be much appreciated! Olivier A
Here is the log: armant-oli@cadux-genomik:/TMPLOCAL/olivier_data/Seq_nottrimmed/masurca$ cat nohup.out [Thu Jun 11 13:47:49 CEST 2020] Processing pe library reads [Thu Jun 11 17:20:33 CEST 2020] Processing sj library reads [Thu Jun 11 19:14:48 CEST 2020] Average PE read length 175 [Thu Jun 11 19:14:49 CEST 2020] Using kmer size of 67 for the graph [Thu Jun 11 19:14:49 CEST 2020] MIN_Q_CHAR: 33 [Thu Jun 11 19:14:50 CEST 2020] Creating mer database for Quorum [Thu Jun 11 21:58:17 CEST 2020] Error correct PE [Fri Jun 12 03:56:09 CEST 2020] Error correct JUMP [Fri Jun 12 06:22:37 CEST 2020] Estimating genome size [Fri Jun 12 08:52:17 CEST 2020] Estimated genome size: 5938548948 [Fri Jun 12 08:52:17 CEST 2020] Creating k-unitigs with k=67 [Fri Jun 12 18:36:45 CEST 2020] Creating k-unitigs with k=31 => k-unitgs running for more than 15 days
And here is the assembl.sh file, and the commands reached for now
!/bin/bash
assemble.sh generated by masurca
CONFIG_PATH="/TMPLOCAL/olivier_data/Seq_nottrimmed/masurca/config.mascura" CMD_PATH="/TMPLOCAL/MaSuRCA-3.4.1/bin/masurca" set -o pipefail
Test that we support <() redirection
(eval "cat <(echo test) >/dev/null" 2>/dev/null) || { echo >&2 "ERROR: The shell used is missing important features." echo >&2 " Run the assembly script directly as './$0'" exit 1 }
Parse command line switches
while getopts ":rc" o; do case "${o}" in c) echo "configuration file is '$CONFIG_PATH'" exit 0 ;; r) echo "Rerunning configuration" exec perl "$CMD_PATH" "$CONFIG_PATH" echo "Failed to rerun configuration" exit 1 ;; *) echo "Usage: $0 [-r] [-c]" exit 1 ;; esac done set +e
Set some paths and prime system to save environment variables
save () { (echo -n "$1=\""; eval "echo -n \"\$$1\""; echo '"') >> environment.sh } GC= RC= NC= if tty -s < /dev/fd/1 2> /dev/null; then GC='\e[0;32m' RC='\e[0;31m' NC='\e[0m' fi log () { d=$(date) echo -e "${GC}[$d]${NC} $@" } fail () { d=$(date) echo -e "${RC}[$d]${NC} $@" exit 1 } signaled () { fail Interrupted } trap signaled TERM QUIT INT rm -f environment.sh; touch environment.sh
To run tasks in parallel
run_bg () {
semaphore -j $NUMTHREADS --id masurca$$ -- "$@"
}
run_wait () {
semaphore -j $NUMTHREADS --id masurca$$ --wait
}
export PATH="/TMPLOCAL/MaSuRCA-3.4.1/bin/../CA8/Linux-amd64/bin:/TMPLOCAL/MaSuRCA-3.4.1/bin:$PATH" save PATH export PERL5LIB=/TMPLOCAL/MaSuRCA-3.4.1/bin/../lib/perl${PERL5LIB:+:$PERL5LIB} save PERL5LIB NUM_THREADS=96 save NUM_THREADS log 'Processing pe library reads' rm -rf meanAndStdevByPrefix.pe.txt echo 'p1 350 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p1' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p1.renamed.fastq' echo 'p2 550 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p2' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p2.renamed.fastq' echo 'p3 350 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p3' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p3.renamed.fastq' echo 'p4 550 150' >> meanAndStdevByPrefix.pe.txt rename_filter_fastq 'p4' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'p4.renamed.fastq' log 'Processing sj library reads' rm -rf meanAndStdevByPrefix.sj.txt echo 'm1 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm1' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm1.renamed.fastq' echo 'm2 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm2' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm2.renamed.fastq' echo 'm3 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm3' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R1.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R2.fastq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm3.renamed.fastq' echo 'm4 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm4' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm4.renamed.fastq' echo 'm5 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm5' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm5.renamed.fastq' echo 'm6 500 100' >> meanAndStdevByPrefix.sj.txt rename_filter_fastq 'm6' <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.1.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}') <(exec expand_fastq '/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.2.fq.gz' | awk '{if(length($0>250)) print substr($0,1,250); else print $0;}' ) > 'm6.renamed.fastq'
head -q -n 40000 p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq | grep --text -v '^+' | grep --text -v '^@' > pe_data.tmp export PE_AVG_READ_LENGTH=
awk '{if(length($1)>31){n+=length($1);m++;}}END{print int(n/m)}' pe_data.tmp
save PE_AVG_READ_LENGTH log Average PE read length $PE_AVG_READ_LENGTH KMER=for f in p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq;do head -n 80000 $f |tail -n 40000;done | perl -e 'while($line=<STDIN>){$line=<STDIN>;chomp($line);push(@lines,$line);for($i=0;$i<6;$i++){$line=<STDIN>;}}$min_len=100000;$base_count=0;foreach $l(@lines){$base_count+=length($l);push(@lengths,length($l));@f=split("",$l);foreach $base(@f){if(uc($base) eq "G" || uc($base) eq "C"){$gc_count++}}} @lengths =sort {$b <=> $a} @lengths; $min_len=$lengths[int($#lengths*.75)]; $gc_ratio=$gc_count/$base_count;$kmer=0;if($gc_ratio>=0.35 && $gc_ratio<=0.6){$kmer=int($min_len*.66);}else{$kmer=int($min_len*.33);} $kmer++ if($kmer%2==0); $kmer=31 if($kmer<31); $kmer=127 if($kmer>127); print $kmer'
save KMER log Using kmer size of $KMER for the graph KMER_J=31 save KMER_J MIN_Q_CHAR=cat p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq |head -n 50000 | awk 'BEGIN{flag=0}{if($0 ~ /^\+/){flag=1}else if(flag==1){print $0;flag=0}}' | perl -ne 'BEGIN{$q0_char="@";}{chomp;@f=split "";foreach $v(@f){if(ord($v)<ord($q0_char)){$q0_char=$v;}}}END{$ans=ord($q0_char);if($ans<64){print "33\n"}else{print "64\n"}}'
save MIN_Q_CHAR log MIN_Q_CHAR: $MIN_Q_CHAR JF_SIZE=ls -l *.fastq | awk '{n+=$5}END{s=int(n/50); if(s>60000000000)printf "%.0f",s;else print "60000000000";}'
save JF_SIZE perl -e '{if(int('$JF_SIZE')>60000000000){print "WARNING: JF_SIZE set too low, increasing JF_SIZE to at least '$JF_SIZE', this automatic increase may be not enough!\n"}}' log Creating mer database for Quorum awk '{print substr($0,1,200)}' p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq | quorum_create_database -t 96 -s $JF_SIZE -b 7 -m 24 -q $((MIN_Q_CHAR + 5)) -o quorum_mer_db.jf.tmp /dev/stdin && mv quorum_mer_db.jf.tmp quorum_mer_db.jf if [ 0 != 0 ]; then fail Increase JF_SIZE in config file, the recommendation is to set this to genome_size*coverage/2 filog Error correct PE
quorum_error_correct_reads -q $(($MIN_Q_CHAR + 40)) --contaminant=/TMPLOCAL/MaSuRCA-3.4.1/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 96 -w 10 -e 3 -M quorum_mer_db.jf p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq --no-discard -o pe.cor.tmp --verbose 1>quorum.err 2>&1 && mv pe.cor.tmp.fa pe.cor.fa || fail Error correction of PE reads failed. Check pe.cor.log.
log Error correct JUMP
quorum_error_correct_reads -q $(($MIN_Q_CHAR + 40)) --contaminant=/TMPLOCAL/MaSuRCA-3.4.1/bin/../share/adapter.jf -m 1 -s 1 -g 2 -a 3 -t 96 -w 10 -e 3 -M quorum_mer_db.jf m1.renamed.fastq m2.renamed.fastq m3.renamed.fastq m4.renamed.fastq m5.renamed.fastq m6.renamed.fastq --no-discard -o sj.cor.tmp --verbose 1>quorum.err 2>&1 && mv sj.cor.tmp.fa sj.cor.fa || fail Error correction of JUMP reads failed. Check sj.cor.log.
if [ -s ESTIMATED_GENOME_SIZE.txt ];then ESTIMATED_GENOME_SIZE=
head -n 1 ESTIMATED_GENOME_SIZE.txt
else log Estimating genome size jellyfish count -m 31 -t 96 -C -s $JF_SIZE -o k_u_hash_0 pe.cor.fa sj.cor.fa export ESTIMATED_GENOME_SIZE=jellyfish histo -t 96 -h 1 k_u_hash_0 | tail -n 1 |awk '{print $2}'
echo $ESTIMATED_GENOME_SIZE > ESTIMATED_GENOME_SIZE.txt fi save ESTIMATED_GENOME_SIZE log "Estimated genome size: $ESTIMATED_GENOME_SIZE"log Creating k-unitigs with k=$KMER create_k_unitigs_large_k -c $(($KMER-1)) -t 96 -m $KMER -n $(($ESTIMATED_GENOME_SIZE2)) -l $KMER -f
perl -e 'print 1/'$KMER'/1e5'
pe.cor.fa sj.cor.fa | grep --text -v '^>' | perl -ane '{$seq=$F[0]; $F[0]=~tr/ACTGactg/TGACtgac/;$revseq=reverse($F[0]); $h{($seq ge $revseq)?$seq:$revseq}=1;}END{$n=0;foreach $k(keys %h){print ">",$n++," length:",length($k),"\n$k\n"}}' > guillaumeKUnitigsAtLeast32bases_all.fasta.tmp && mv guillaumeKUnitigsAtLeast32bases_all.fasta.tmp guillaumeKUnitigsAtLeast32bases_all.fasta if [[ $KMER -eq $KMER_J ]];then ln -s guillaumeKUnitigsAtLeast32bases_all.fasta guillaumeKUnitigsAtLeast32bases_all.jump.fasta else log Creating k-unitigs with k=$KMER_J create_k_unitigs_large_k -c $(($KMER_J-1)) -t 96 -m $KMER_J -n $(($ESTIMATED_GENOME_SIZE2)) -l $KMER_J -fperl -e 'print 1/'$KMER_J'/1e5'
pe.cor.fa sj.cor.fa | grep --text -v '^>' | perl -ane '{$seq=$F[0]; $F[0]=~tr/ACTGactg/TGACtgac/;$revseq=reverse($F[0]); $h{($seq ge $revseq)?$seq:$revseq}=1;}END{$n=0;foreach $k(keys %h){print ">",$n++," length:",length($k),"\n$k\n"}}' > guillaumeKUnitigsAtLeast32bases_all.jump.fasta.tmp && mv guillaumeKUnitigsAtLeast32bases_all.jump.fasta.tmp guillaumeKUnitigsAtLeast32bases_all.jump.fasta fi