alekseyzimin / masurca

GNU General Public License v3.0
243 stars 35 forks source link

ERROR: failed to merge alignments at position 290 #194

Open dcopetti opened 3 years ago

dcopetti commented 3 years ago

Hello, I am running MaSuRCA on a plant genome, I have PacBio and Illumina data as input. I just noticed this notification:

$ ./assemble.sh
[Mon Sep 21 16:13:10 CEST 2020] Processing pe library reads
[Mon Sep 21 17:05:30 CEST 2020] Average PE read length 100
[Mon Sep 21 17:05:30 CEST 2020] Using kmer size of 67 for the graph
[Mon Sep 21 17:05:31 CEST 2020] MIN_Q_CHAR: 33
[Mon Sep 21 17:05:31 CEST 2020] Creating mer database for Quorum
[Mon Sep 21 17:46:52 CEST 2020] Error correct PE
[Mon Sep 21 19:20:01 CEST 2020] Estimating genome size
[Mon Sep 21 19:47:25 CEST 2020] Estimated genome size: 1027848933
[Mon Sep 21 19:47:25 CEST 2020] Creating k-unitigs with k=67
[Mon Sep 21 21:22:01 CEST 2020] Computing super reads from PE
[Tue Sep 22 06:28:07 CEST 2020] Using CABOG from /home/copettid/anaconda3/envs/masurca_env/bin/../CA8/Linux-amd64/bin
[Tue Sep 22 06:28:07 CEST 2020] Running mega-reads correction/assembly
[Tue Sep 22 06:28:07 CEST 2020] Using mer size 15 for mapping, B=17, d=0.029
[Tue Sep 22 06:28:07 CEST 2020] Estimated Genome Size 1027848933
[Tue Sep 22 06:28:07 CEST 2020] Estimated Ploidy 1
[Tue Sep 22 06:28:07 CEST 2020] Using 40 threads
[Tue Sep 22 06:28:07 CEST 2020] Output prefix mr.41.15.17.0.029
[Tue Sep 22 06:28:07 CEST 2020] Pacbio coverage <25x, using the longest subreads
[Tue Sep 22 06:38:05 CEST 2020] Reducing super-read k-mer size
[Tue Sep 22 07:47:00 CEST 2020] Mega-reads pass 1
[Tue Sep 22 07:47:00 CEST 2020] Running locally in 1 batch
Processed 500000 super reads, irreducible 465700, processing 336 super reads per second
Processed 1000000 super reads, irreducible 894191, processing 719 super reads per second
Processed 1500000 super reads, irreducible 1277225, processing 965 super reads per second
Processed 2000000 super reads, irreducible 1617361, processing 1138 super reads per second
Processed 2500000 super reads, irreducible 1948162, processing 1259 super reads per second
Processed 3000000 super reads, irreducible 2308270, processing 1302 super reads per second
Processed 3500000 super reads, irreducible 2703721, processing 1160 super reads per second
Processed 4000000 super reads, irreducible 3128906, processing 1030 super reads per second
Processed 4500000 super reads, irreducible 3577895, processing 866 super reads per second
Processed 5000000 super reads, irreducible 4044018, processing 974 super reads per second
Processed 5500000 super reads, irreducible 4519517, processing 1340 super reads per second
Processed 6000000 super reads, irreducible 4994571, processing 1779 super reads per second
[Sat Oct  3 12:45:59 CEST 2020] Mega-reads pass 2
[Sat Oct  3 12:45:59 CEST 2020] Running locally in 1 batch
[Sun Oct 11 23:32:01 CEST 2020] Refining alignments
[Mon Oct 12 01:35:56 CEST 2020] Joining
[Mon Oct 12 02:16:48 CEST 2020] Gap consensus
ERROR: failed to merge alignments at position 290
       Please file a bug report

I wonder if it is something I should worry about at the moment, I would like to make sure that at least the mega reads are produced correctly. Thanks, Dario

dcopetti commented 3 years ago

Hello,

The job eventually completed the megareads step, and now it is looking like it is running the Celera Assembler. The stdout shows this:

ERROR: failed to merge alignments at position 3355
       Please file a bug report
ERROR: failed to merge alignments at position 327
       Please file a bug report
ERROR: failed to merge alignments at position 468
       Please file a bug report
cat: merges.1.txt: No such file or directory
cat: merges.2.txt: No such file or directory
cat: merges.4.txt: No such file or directory
cat: merges.7.txt: No such file or directory
[...]
cat: merges.57.txt: No such file or directory
cat: merges.58.txt: No such file or directory
cat: merges.62.txt: No such file or directory
cat: merges.64.txt: No such file or directory
cat: merges.66.txt: No such file or directory
cat: merges.70.txt: No such file or directory
cat: merges.74.txt: No such file or directory
cat: merges.77.txt: No such file or directory
cat: merges.78.txt: No such file or directory
[Wed Oct 28 17:42:09 CET 2020] Warning! Some or all gap consensus jobs failed, see files in mr.41.15.17.0.029.join_consensus.tmp, proceeding anyway, to rerun gap consensus erase mr.41.15.17.0.029.1.fa and re-run assemble.sh
[Wed Oct 28 17:44:26 CET 2020] Generating assembly input files
[Wed Oct 28 22:30:19 CET 2020] Coverage threshold for splitting unitigs is 35 minimum ovl 250
[Wed Oct 28 22:30:19 CET 2020] Running assembly

I wonder if there is something I should worry about. During the megareads step there have been many lines like the ones on top.

Also, I noticed that both now and earlier (when running nucmer for the megareads I guess) the job was taking about double the number of CPUs I gave it (70-80 when giving 40): is that normal? The sys admins are not happy to see that high load (the machine has 48 cores). Can it be now that a file called mr.41.15.17.0.029.1.fa has the megareads? Given its size and header structure, I would guess so. I was expecting to have Flye running now, is it coming later? The MaSuRCA version I am using is 3.3.9 Thanks,

Dario

sarahshah commented 3 years ago

I am getting the same kind of error with MaSuRCA v4.0.1, running it on a large eukaryotic genome (dataset is a combo of PacBio, Nanopore, and Illumina reads):

[Wed Feb 10 09:39:10 AEST 2021] Processing pe library reads
[Wed Feb 10 09:39:10 AEST 2021] Average PE read length 148
[Wed Feb 10 09:39:10 AEST 2021] Using kmer size of 99 for the graph
cat: write error: Broken pipe
[Wed Feb 10 09:39:10 AEST 2021] MIN_Q_CHAR: 33
[Wed Feb 10 09:39:10 AEST 2021] Estimated genome size: 1485293479
[Wed Feb 10 09:39:10 AEST 2021] Creating k-unitigs with k=99
[Wed Feb 10 15:49:52 AEST 2021] Computing super reads from PE 
[Thu Feb 11 05:24:03 AEST 2021] Using CABOG from /gpfs1/homes/s4255161/MaSuRCA-4.0.1/bin/../CA8/L
inux-amd64/bin
[Thu Feb 11 05:24:03 AEST 2021] Running mega-reads correction/assembly
[Thu Feb 11 05:24:03 AEST 2021] Using mer size 17 for mapping, B=12, d=0.02
[Thu Feb 11 05:24:03 AEST 2021] Estimated Genome Size 1485293479
[Thu Feb 11 05:24:03 AEST 2021] Estimated Ploidy 1
[Thu Feb 11 05:24:03 AEST 2021] Using 24 threads
[Thu Feb 11 05:24:03 AEST 2021] Output prefix mr.41.17.12.0.02

gzip: stdout: Broken pipe
[Thu Feb 11 05:24:03 AEST 2021] Pre-correction and initial filtering of the long reads
[Thu Feb 11 20:49:47 AEST 2021] Reducing super-read k-mer size
[Thu Feb 11 22:00:41 AEST 2021] Computing mega-reads
[Thu Feb 11 22:00:41 AEST 2021] Running locally in 1 batch
[Mon Feb 15 03:09:59 AEST 2021] Refining alignments
ERROR: failed to merge alignments at position 482
       Please file a bug report
[Mon Feb 15 04:24:38 AEST 2021] Computing allowed merges
[Mon Feb 15 04:27:58 AEST 2021] Joining
[Mon Feb 15 04:38:02 AEST 2021] Gap consensus
[Mon Feb 15 05:10:12 AEST 2021] Generating assembly input files
[Mon Feb 15 08:01:48 AEST 2021] Coverage threshold for splitting unitigs is 37 minimum ovl 499
[Mon Feb 15 08:01:48 AEST 2021] Running assembly

My job hasn't completed yet, but just like @dcopetti it moved on from the error and started running the assembly. Should we be concerned?