Look into possible error

pb-cdunn commented 8 years ago

(UPDATE: Unable to repro, but see comment further down.)

I now know why one of my perfect contigs is shorter than the other in my 5k test. Some alignments have a 1-base deletion, due to polishing. Those differences can accumulate along a path. In fact, it is only by chance that my error-free subreads produce a perfect draft-assembly.

Real DNA has repeats, but a 5k random genome should have none. So polishing should be a no-op. However, consensus chops off a few bases from each read. That's fine, and in this case, fc_consensus chops off fewer than dazcon. However, fc_consensus then appends an erroneous base. dazcon does not exhibit this particular problem.

This error is on the order of 1 base per read, so it's below our accuracy threshold. I will look into polishing problems later. Not a priority.

$ cat 2-asm-falcon/ctg_paths
     0 ctg_circular 000000007:B~000000034:B~000000007:B 000000007:B 5000 94449 000000007:B~000000034:B~000000007:B
     1 ctg_circular 000000007:E~000000034:E~000000007:E 000000007:E 4994 94449 000000007:E~000000034:E~000000007:E

$ LA4Falcon -mo ...
000000007 000000009 -183 99.45  0  1807  1989  1989     0     0   183  1989 overlap
000000009 000000007 -182 99.45  0     0   183  1989     0  1807  1989  1989 overlap
...

  8   10 n   [1,807..1,989] x [    0..  183] :   =     1 diffs  (1 trace pts)

     1797 caaaaaagccagagcaatccccatcttgcgggttgtattgcgctagagccaccacacgcaaaacgagcctggtttcgtactgtggtggaggctgactgcc
                    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
        0 ..........agagcaatccccatcttgcgggttgtattgcgctagagccaccacacgcaaaacgagcctggtttcgtactgtggtggaggctgactgcc   0.0%

     1897 tagcctgttggttactcggcctcaacaagcgatccggatgttgatatttctcacccgccgcggtcatttcgaacctgattcttctctctcc-a.......
          |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||*|
       90 tagcctgttggttactcggcctcaacaagcgatccggatgttgatatttctcacccgccgcggtcatttcgaacctgattcttctctctccgattggact   1.1%

          ...

          tta
 10    8 n   [    0..  183] x [1,807..1,989] :   =     1 diffs  (1 trace pts)

        0 ..........agagcaatccccatcttgcgggttgtattgcgctagagccaccacacgcaaaacgagcctggtttcgtactgtggtggaggctgactgcc
                    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
     1797 caaaaaagccagagcaatccccatcttgcgggttgtattgcgctagagccaccacacgcaaaacgagcctggtttcgtactgtggtggaggctgactgcc   0.0%

       90 tagcctgttggttactcggcctcaacaagcgatccggatgttgatatttctcacccgccgcggtcatttcgaacctgattcttctctctccgattggact
          |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||*|
     1897 tagcctgttggttactcggcctcaacaagcgatccggatgttgatatttctcacccgccgcggtcatttcgaacctgattcttctctctcc-a.......   1.1%

          tta

          ...

pb-jchin commented 8 years ago

@pb-cdunn, point me to the data, I hope to take a look.

I am not sure whether there is some overlooked off-by-one error in the code. The dynamical programming to look for a path for generating consensus may have more error-prone ends. If I recall correctly (again, I need to check the code for sure), fc_consensus did not trim the results explicitly. Namely, the errors at the end can be still the "natural" output of such boundary effect of the dynamic programming optimization process.

pb-cdunn commented 8 years ago

Run FALCON-examples/run/synth0. The input is available via git-sym, but also internally at /pbi/dept/secondary/siv/testdata/hgap/synth5k/subreads.fasta.

pb-cdunn commented 8 years ago

I am unable to reproduce this anymore. LAshow reports perfect alignements, and LA4Falcon -mo reports 100.00 instead of 99.45. Maybe the other fix also fixed this, or maybe there was something fixed in daligner or LAshow.

The only thing that is slightly odd is in ctg_paths (also seen when I first posted this report). When I use dazcon:

     0 ctg_circular 000000005:B~000000001:B~000000005:B 000000005:B 5000 94050 000000005:B~000000001:B~000000005:B
     1 ctg_circular 000000005:E~000000001:E~000000005:E 000000005:E 5000 94050 000000005:E~000000001:E~000000005:E

But when I use fc_consensus.py:

000000F ctg_linear 000000045:B~NA~000000007:B 000000045:B 5000 86086 000000045:B~NA~000000007:B|000000007:B~000000016:B~000000045:B
000000R ctg_linear 000000045:E~000000016:E~000000007:E 000000045:E 4873 86086 000000045:E~000000016:E~000000007:E|000000007:E~NA~000000045:E

There is one circular and one non-circular path. The wrong one is chosen for the opposite direction (2nd line), and 4873 should be 5000.

I'll leave this open to investigate that further someday.

PacificBiosciences / FALCON

Look into possible error #331