a-ludi / djunctor

Close assembly gaps using long-reads with focus on correctness.
MIT License
0 stars 0 forks source link

Preserve input scaffolds #9

Closed a-ludi closed 6 years ago

a-ludi commented 6 years ago

The input to this algorithm should be a a set of scaffolds. But the dazzler tools will automatically split the input sequence if it entcounters N's. Thus, scaffolds will be broken into contigs.

Expected Behaviour

This algorithm should preserve the scaffold structure of the input data, ie. if contig X is known to be just before contig then they should be joined in that order or left as is.

Bonus

If this algorithm has strong evidence for an erroneous scaffold the information should be emitted such that it can be used for further processing.

a-ludi commented 6 years ago

See issue #45.

a-ludi commented 6 years ago

The .dam files still contains scaffolding information. DBshow -n db.dam will reveal it in the FASTA headers. The following example shows two scaffolds: one with nine contigs and the other with one contig. The the number of ns in the original sequence can be derived from the given coordinates.

>reference_mod/1/0_837550 RQ=0.850 :: Contig 0[0,8300]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 1[12400,20750]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 2[29200,154900]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 3[159900,169900]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 4[174900,200650]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 5[203650,216400]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 6[218900,235150]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 7[238750,260150]
>reference_mod/1/0_837550 RQ=0.850 :: Contig 8[263650,837500]
>reference_mod/2/0_837550 RQ=0.850 :: Contig 0[0,1450]