Open pb-cdunn opened 8 years ago
@pb-cdunn, point me to the data, I hope to take a look.
I am not sure whether there is some overlooked off-by-one error in the code. The dynamical programming to look for a path for generating consensus may have more error-prone ends. If I recall correctly (again, I need to check the code for sure), fc_consensus did not trim the results explicitly. Namely, the errors at the end can be still the "natural" output of such boundary effect of the dynamic programming optimization process.
Run FALCON-examples/run/synth0
. The input is available via git-sym, but also internally at /pbi/dept/secondary/siv/testdata/hgap/synth5k/subreads.fasta
.
I am unable to reproduce this anymore. LAshow reports perfect alignements, and LA4Falcon -mo reports 100.00
instead of 99.45
. Maybe the other fix also fixed this, or maybe there was something fixed in daligner or LAshow.
The only thing that is slightly odd is in ctg_paths
(also seen when I first posted this report). When I use dazcon:
0 ctg_circular 000000005:B~000000001:B~000000005:B 000000005:B 5000 94050 000000005:B~000000001:B~000000005:B
1 ctg_circular 000000005:E~000000001:E~000000005:E 000000005:E 5000 94050 000000005:E~000000001:E~000000005:E
But when I use fc_consensus.py:
000000F ctg_linear 000000045:B~NA~000000007:B 000000045:B 5000 86086 000000045:B~NA~000000007:B|000000007:B~000000016:B~000000045:B
000000R ctg_linear 000000045:E~000000016:E~000000007:E 000000045:E 4873 86086 000000045:E~000000016:E~000000007:E|000000007:E~NA~000000045:E
There is one circular and one non-circular path. The wrong one is chosen for the opposite direction (2nd line), and 4873
should be 5000
.
I'll leave this open to investigate that further someday.
(UPDATE: Unable to repro, but see comment further down.)
I now know why one of my perfect contigs is shorter than the other in my 5k test. Some alignments have a 1-base deletion, due to polishing. Those differences can accumulate along a path. In fact, it is only by chance that my error-free subreads produce a perfect draft-assembly.
Real DNA has repeats, but a 5k random genome should have none. So polishing should be a no-op. However, consensus chops off a few bases from each read. That's fine, and in this case, fc_consensus chops off fewer than dazcon. However, fc_consensus then appends an erroneous base. dazcon does not exhibit this particular problem.
This error is on the order of 1 base per read, so it's below our accuracy threshold. I will look into polishing problems later. Not a priority.