alekseyzimin / masurca

GNU General Public License v3.0
245 stars 35 forks source link

restarting during consensus step #21

Closed EarlyEvol closed 6 years ago

EarlyEvol commented 6 years ago

I'm wondering how robust Masurca is to restarting after getting killed during the assembly step. Basically, my cluster has a queue that is preemptable so jobs can be killed and restarted if a higher priority job gets assigned to the node it is running on.

I have three samples I'm assembling. The consensus step seems to take a ton of time, thus it has been preempted in 2/3 assemblies. The 5-consensus/ from the assembly that was not preempted has outputs like this: [user@login3 assemblies]$ ls SAMPLE1_masurca.nxtrim/CA.mr.41.15.17.0.029/5-consensus|tail -n 20 genome.129.iid genome_129.success genome_130.cns.err genome.130.fa genome_130.fix.err genome_130.fixes genome.130.iid genome_130.success genome_131.cns.err genome.131.fa genome_131.fix.err genome_131.fixes genome.131.iid genome_131.success genome.fixes genome.fixes.err genome.partitioned genome.partitioned.err genome.sampling genome.sampling.dat

Another is missing .fa .iid .lay files for the last 2 iterations and has some extra files for earlier iterations: [user@login3 assemblies]$ ls SAMPLE2_masurca.nxtrim/CA.mr.41.15.17.0.029/5-consensus|tail -n 40 genome.130.iid genome.130.lay genome_130.success genome.130.tmp.layout genome_131.cns.err genome.131.fa genome.131.fasta genome.131.fasta.qual genome.131.fasta.qv genome_131.fix.err genome_131.fixes genome.131.iid genome.131.lay genome_131.success genome.131.tmp.layout genome_132.cns.err genome.132.fa genome.132.fasta genome.132.fasta.qual genome.132.fasta.qv genome_132.fix.err genome_132.fixes genome.132.iid genome.132.lay genome_132.success genome.132.tmp.layout genome_133.cns.err genome_133.fix.err genome_133.fixes genome_133.success genome_134.cns.err genome_134.fix.err genome_134.fixes genome_134.success genome.fixes genome.fixes.err genome.partitioned genome.partitioned.err genome.sampling genome.sampling.dat

Finally, the third assembly is still running and has been stuck on making the final .cns.err for 40 hrs. [earlm1@login3 assemblies]$ ll SAMPLE3_masurca.nxtrim/CA.mr.41.15.17.0.029/5-consensus|tail -n 20 -rw-r--r-- 1 earlm1 schlenke 0 May 1 16:49 genome_129.success -rw-r--r-- 1 earlm1 schlenke 47970 May 1 16:49 genome_130.cns.err -rw-r--r-- 1 earlm1 schlenke 269 May 1 16:49 genome_130.fix.err -rw-r--r-- 1 earlm1 schlenke 0 May 1 16:49 genome_130.fixes -rw-r--r-- 1 earlm1 schlenke 0 May 1 16:49 genome_130.success -rw-r--r-- 1 earlm1 schlenke 43216 May 1 16:49 genome_131.cns.err -rw-r--r-- 1 earlm1 schlenke 269 May 1 16:49 genome_131.fix.err -rw-r--r-- 1 earlm1 schlenke 0 May 1 16:49 genome_131.fixes -rw-r--r-- 1 earlm1 schlenke 0 May 1 16:49 genome_131.success -rw-r--r-- 1 earlm1 schlenke 41547 May 1 16:49 genome_132.cns.err -rw-r--r-- 1 earlm1 schlenke 269 May 1 16:49 genome_132.fix.err -rw-r--r-- 1 earlm1 schlenke 0 May 1 16:49 genome_132.fixes -rw-r--r-- 1 earlm1 schlenke 0 May 1 16:49 genome_132.success -rw-r--r-- 1 earlm1 schlenke 18943 May 1 17:06 genome_133.cns.err -rw-r--r-- 1 earlm1 schlenke 1972 May 1 17:44 genome_133.fix.err -rw-r--r-- 1 earlm1 schlenke 7951113 May 1 17:44 genome_133.fixes -rw-r--r-- 1 earlm1 schlenke 0 May 1 17:44 genome_133.success -rw-r--r-- 1 earlm1 schlenke 1142416 May 3 14:27 genome_134.cns.err -rw-r--r-- 1 earlm1 schlenke 0 Apr 24 17:37 genome.partitioned -rw-r--r-- 1 earlm1 schlenke 0 Apr 24 16:59 genome.partitioned.err

The file is largely just a list of alignment failures. [user@login3 assemblies]$ tail SAMPLE3_masurca.nxtrim/CA.mr.41.15.17.0.029/5-consensus/genome_134.cns.err MultiAlignUnitig()-- failed to align fragment 59091605 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 45901675 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 43236611 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 10799714 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 8139984 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 49529163 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 36254081 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 30386433 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 13041716 in unitig 6690241. MultiAlignUnitig()-- failed to align fragment 39923494 in unitig 6690241.

Question 1: should I kill the SAMPLE3 assembly, remove the 5-consensus dir and restart (and prevent it from getting preempted) Question 2: should i trust the assembly of SAMPLE2? It eventually finished the consensus step and continued through to produce a scaffold file that doesn't seem obviously messed up.

Thanks, Earl

alekseyzimin commented 6 years ago

Consensus is the most sensitive part of the assembly. It is disk-intensive, so if possible it should be run with local storage. The consensus is generally run twice: first with pbutgcns which does consensus for long contigs with long reads, and it is fast, but it does not support short reads, and then with utgcns which supports everything, and it skips the unitigs where consensus has been already done by pbutgcns. If the assembly is killed during the first pass, you may be stuck running the slow utgcns. In general it is not a good idea to kill the assembly during consensus. I advise removing 4-, 5- and genome.tigstore and re-generating/re-running assemble.sh.

EarlyEvol commented 6 years ago

Ah, I see. Ok, I removed those directories and restarted the assemblies. Just curious, for the one that got killed and restarted during the consensus step but did end up finishing, is the output trustworthy? I restarted it anyway just in case. Thanks for all the info!

alekseyzimin commented 6 years ago

The output is fine, if the assembly finished.

On Fri, May 4, 2018 at 3:30 PM, BurlEarl notifications@github.com wrote:

Ah, I see. Ok, I removed those directories and restarted the assemblies. Just curious, for the one that got killed and restarted during the consensus step but did end up finishing, is the output trustworthy? I restarted it anyway just in case. Thanks for all the info!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/21#issuecomment-386709424, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHfQEFkydkOXTuyUeJzqY5OTfnZQVks5tvKxdgaJpZM4Tx1sE .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com

EarlyEvol commented 6 years ago

Awesome, thanks so much!