Understanding tigmint-long outputs

TromboneEngineer commented 1 year ago

Why would it be that the output sequences from tigmint-long contain more total sequences than the input draft assembly? I counted FASTA headers with grep ">" draft.fa and grep ">" tigmint-long-out.fa but was surprised when the count was significantly higher for the latter. I had envisioned that the effect of contig reordering and gap closing would result in more contigous assemblies, meaning fewer contigs. Is this expected behavior?

lcoombe commented 1 year ago

Hi @TromboneEngineer,

So Tigmint-long alone detects and breaks the input assembly at putative misassemblies. It doesn't do any contig scaffolding or gap-closing. For that behaviour, we suggest using our LongStitch pipeline (https://github.com/bcgsc/longstitch) for your overall assembly project, since that runs Tigmint-long (which breaks the assembly at putative misassemblies), then uses the same long read data for scaffolding (with ntLink and optionally arks-long). The scaffolding step sounds like what you're thinking of in terms of joining contigs and closing gaps (ntLink can perform these functions)

Another thing to keep in mind is that depending on the cuts made, you can have very small pieces in the output. So, it's best to use a tool like abyss-fac to get a sense of the number of sequences above a threshold (like 500bp) as well as the overall change in contiguity (ex. NG50)

TromboneEngineer commented 1 year ago

Thank you very much, I will take a look at the LongStitch pipeline. I do understand that contig count is not a comprehensive metric alone, and that NG50 and number of contigs over a threshold length are just as important. But it was probably my misunderstanding of what tigmint-long does the above spurred that confusion.

lcoombe commented 1 year ago

Sounds good! Just let me know if you have any further questions about LongStitch - I handle the issues over at that repo as well.

bcgsc / tigmint

Understanding tigmint-long outputs #120