bcgsc / tigmint

⛓ Correct misassemblies using linked AND long reads
https://bcgsc.github.io/tigmint/
GNU General Public License v3.0
54 stars 13 forks source link

Tigmint's principle to break sequences #9

Closed dcopetti closed 6 years ago

dcopetti commented 6 years ago

Hello,

I am looking at the output of Tigmint and I would like to know better the rationale used to split sequences. In the attached example, tigmint broke the scaffold at 965,520 (red vertical line). The MP data (green and gray lines) is also iffy, so I agree with splitting it. But why not break it at the nearby gap (black line)? Contig construction and gap closing are more reliable steps than scaffolding, so to me the modifications should happen at the gaps (when possible). Blue bars are repeat, red are genes.

capture2

Then I have some more general questions/considerations. First, the output .bed file:

scaffold3872    0       965417  scaffold3872-1
scaffold3872    965417  965520  scaffold3872-2
scaffold3872    965520  3294149 scaffold3872-3

I see that the coordinates are 0-based, but why are some bases present on both broken scaffolds (pos 965417 and 965520 here)? It would be nice to have a 1-based, non overlapping output to use it to make subsets of the intervals (when working manually on a subset of sequences I mean).

Would it be possible to avoid having terminal Ns? E.g. adapting tigmint's breaking point to the nearest gap (if present within e.g. 1 kb) and exclude gap-only contigs in the output? I am not sure if this happens, but it would be a problem when submitting the broken scaffolds to NCBI - they don't like terminal Ns.

Is it possible to have a parameter for minimum contig size? E.g. with minctgsize=1000, if tigmint has evidence to break a scaffold in two regions 522 bp away, break only one (the one that has more support for misassembly). Or something like that. I often see many small contigs, even down to 1 bp!

In the broken scaffolds, what does the lowercase first and last base mean? thanks for the support, Dario

lcoombe commented 6 years ago

Hi Dario,

Tigmint scans the scaffolds with fixed windows to find areas where there are a small number of spanning molecules (ie. few chromium molecules span from the beginning to the end of the window). For regions that are correct, there will be a long string of windows where there are many spanning molecules ('hits'). Then, where there is a misassembly, there will be a string of windows where there are few spanning molecules ('misses'), followed by more hits when the window is past the area of concern. Our approach is to cut at the end of the last window before a string of 'misses', and then at the beginning of the first window after the string of 'misses'. The general idea is to ensure that the sequences flanking the cuts have good barcode support. A good way to visualize this would be to add a track in IGV for the chromium molecule extents (represented in BED format).

For the BED format question, take a look at the documentation describing the chromEnd, which should answer your concern about overlapping bases: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Terminal N's are stripped from the scaffolds. The only exception would be if the cuts are such that a sequence is empty - then a single N is added for the sequence.

Yes, the small contigs are expected from the cutting scheme that I've outlined above. I'll make a note of your suggestion to automatically filter out small contigs - in the meantime, though, a simple seqtk command could do that job for you.

I'm not sure what you are meaning by the lowercase first and last bases -- are the corresponding bases lowercase in your input assembly?

Hope that helps, and let me know if you have more questions! Lauren

dcopetti commented 6 years ago

Hi Lauren, Thanks for the informative answers, things are more clear now. The issue with the lowercase first and last base is actually how my assembly was written, I saw that in the small untouched scaffolds. The only part I still don't understand is this: Terminal N's are stripped from the scaffolds. The only exception would be if the cuts are such that a sequence is empty - then a single N is added for the sequence. can you make me an example for this case? Thanks,

Dario

lcoombe commented 6 years ago

Hi Dario, No problem!

So say this scaffold:

>1
ACTGTGTGCCGGGTTTGGGACNNNATCATGCATCAGCGCAGTCAGCATCAGC

was cut twice:

>1-1
ACTGTGTGCCGGGTTTGGGACNNN
>1-2
ATCA
>1-3
TGCATCAGCGCAGTCAGCATCAGC

There are terminal Ns in the first sequence, so tigmint will strip these off the sequence:

>1-1
ACTGTGTGCCGGGTTTGGGAC
>1-2
ATCA
>1-3
TGCATCAGCGCAGTCAGCATCAGC

We have seen corner cases where the cutting coordinates are such that we end up with an empty sequence: Ex:

>1-1
ACTGTGTGCCGGGTTTGGGAC
>1-2
NNN
>1-3
ATCATGCATCAGCGCAGTCAGCATCAGC

If we just strip the terminal N's as above, the sequence will be empty. Our workaround is to still include all sequence headers, but just print a single N to prevent any sequences from being empty. Empty sequences can break downstream tools (such as abyss-fac).

>1-1
ACTGTGTGCCGGGTTTGGGACNNNATCA
>1-2
N
>1-3
TGCATCAGCGCAGTCAGCATCAGC

Hope that clears things up! Lauren

lcoombe commented 6 years ago

Hi Dario -- I'll close this for now, but feel free to re-open if you have any further questions about how Tigmint breaks sequences!