Closed a-ludi closed 6 years ago
See issue #45.
The .dam
files still contains scaffolding information. DBshow -n db.dam
will reveal it in the FASTA headers. The following example shows two scaffolds: one with nine contigs and the other with one contig. The the number of n
s in the original sequence can be derived from the given coordinates.
>reference_mod/1/0_837550 RQ=0.850 :: Contig 0[0,8300]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 1[12400,20750]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 2[29200,154900]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 3[159900,169900]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 4[174900,200650]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 5[203650,216400]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 6[218900,235150]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 7[238750,260150]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 8[263650,837500]
>reference_mod/2/0_837550 RQ=0.850 :: Contig 0[0,1450]
The input to this algorithm should be a a set of scaffolds. But the dazzler tools will automatically split the input sequence if it entcounters N's. Thus, scaffolds will be broken into contigs.
Expected Behaviour
This algorithm should preserve the scaffold structure of the input data, ie. if contig X is known to be just before contig then they should be joined in that order or left as is.
Bonus
If this algorithm has strong evidence for an erroneous scaffold the information should be emitted such that it can be used for further processing.