bcgsc / ntJoin

đź”—Genome assembly scaffolder using minimizer graphs
GNU General Public License v3.0
81 stars 15 forks source link

Overlap detection and edge weight #115

Open tbusschau opened 3 weeks ago

tbusschau commented 3 weeks ago

Hi I'm working on scaffolding assembled contigs using ntLink and ntJoin. For this I have a set of long reads and four draft contig-level assemblies that were assembled using combinations of linked, short, and long reads. These assemblies are relatively complete ranging from 85% to 92% BUSCOs and contig N50 from 100kb to 800kb. I'm scaffolding the most contiguous draft using ntLink and then ntJoin with the remaining drafts as the references.

I like the overlap detection features in these tools, but I'm unsure how they differ between ntLink and ntJoin. My understanding is that ntLink detects overlaps that are supported by the reads, at least in most cases. In ntJoin it seems overlaps could be detected without additional support. For example, if I have a target assembly with weight=1 and keep the default edge weight (n=1), repeats at contig ends could cause some contigs to be joined eroneously. Is this assumption valid for my case? To prevent that, n needs to be >1 even when using draft assemblies as the reference?

If I wanted to take a conservative approach and condition scaffolding to joins that are supported by at least two references, does it make sense to do this? reference_weights='2 2 2' n=4

I hope that clear enough. Best, Theo

lcoombe commented 2 weeks ago

Hi Theo,

Yes, how overlaps themselves are detected in ntLink and ntJoin differ, since they use different input data, but how the overlaps are resolved is very similar under the hood for the tools. For ntLink, a scaffold graph is constructed (where nodes are contigs, edges between potentially adjacent contigs) using the read mapping information. The read mappings also give evidence for the gap distance between these contigs - if that is less than 0, then indicates a potential overlap. For ntJoin, we use minimizer (particular subsets of k-mers, or substrings of length k) positions in the references to infer a putative overlap.

In both cases, if there is an overlap detected (based on the read mappings in the case of ntLink or mappings between the adjacent sequences in the case of ntJoin), then more granular minimizers are used to map the ends to one another. Once that mapping is determined, then trimming is performed to remove the duplicated overlapping sequences, and merge them together.

For your example, we would not expect that join to happen in ntJoin - that is because only minimizers that are unique in each individual genome are retained. So, minimizers that fall in repeat regions would be expected to be found in at least 2 copies in a given genome, leading them to be filtered out. Indeed, in the case of ntJoin, the edge weight refers to the number of assemblies that support that join. For ntLink, the edges weight is the number of supporting reads.

Yes, if you want to require that at least 2 references support each join, your parameterization makes sense. Just keep in mind that if the references are not chromosome-level, it is possible that you have joins already in your assemblies that may not be in the references. If that's the case, using no_cut=True could be useful to avoid any breaks in your existing contigs, although the downside would be that any potential misassemblies in comparison with your references would not be corrected.

I hope that makes sense - let me know if you have any follow-up questions. Thank you your for interest in our tools! Lauren

tbusschau commented 2 weeks ago

Hi Lauren,

Thank you for the detailed reply. I understand now. I'm actually getting much better results running ntJoin iteratively with different references and no_cut=True. When I use scaffolded assemblies as the references there are huge gaps with well over 100kb N's introduced that significantly increase the assembly size. If the max gap size (G) is set, does it affect where joins are kept or only the number of N's printed for a gap?

lcoombe commented 2 weeks ago

Hi Theo,

Thanks for the feedback! I have also found for some runs doing those iterative runs yields better results - so in my experience it does indeed depend on the assembly, references used (contiguity/completeness), etc. Definitely go with whichever approach is giving you better results!

If you set the max gap size, it will only impact the maximum numbers of N's introduced for given join - the same joins will still be made. I implemented this feature because I saw similar behaviour in terms of gap sizes when using no_cut=True - the reason is that if a given contig is distributed between multiple different locations/paths, no_cut=True will simply choose the 'best' or most confident location of those, and the full contig will be placed there. In the other locations, the length of the contig that would have been there if no_cut=False is changed to N's. Hence, you can sometimes get these rather large N regions, and setting the max gap size will offset that.

tbusschau commented 2 weeks ago

Great! Thank you. I'm quite happy with the results I'm getting now. Just one more question, pure curiosity, then we can close this issue. Is there a proportion of a given contig allowed to overlap? For example; if more than 50% of a contig overlaps a larger contig, will it still be trimmed?

Best, Theo

lcoombe commented 2 weeks ago

Hi Theo,

Awesome, so glad that you're getting good results!

For the overlap trimming, as long as an overlap is detected, it will attempt to do the trimming regardless of the overlap size or proportion. If it's a very small overlap, there's a chance that ntJoin won't be able to resolve it due to the anchoring minimizer approach we use, but in that case you'd still have a gap (Ns) there, so it wouldn't be a contig misassembly or anything like that.