Large scaffolds (parts of scaffolds) not assigned

eriled commented 3 years ago

Hi, I am trying to improve my denovo assembly with ntJoin v1.0.3. For the most part, it seems to be doing what I had hoped, but I some results that I do not understand. I have 2 assemblies - 1 long-read only Oxford Nanopore assembled with flye and polished with polca (using Illumina reads), and one hybrid assembly using Masurca (and the same data). My contig/scaffold count was 1740 for the flye assembly and 2760 for the hybrid assembly. When I use the longer assembly as the reference with ntJoin, I get 252 scaffolds, but in the unassigned scaffolds, I have several very large scaffolds (or parts of scaffolds) that are not being assigned (12M, 8M, 5M). I am just using the default parameters: ntJoin assemble t=4 target=hybrid_oneline.fa target_weight=1 references=Flye_PolcaCorrected_oneline.fa // reference_weights=2 k=32 w=1000 agp=True prefix=hybVflye

I had compared both assemblies with each other as well as separately to a reference genome of a closely related species using nucmer. I can see by comparing the 2 nucmer outputs that these large scaffold parts that are being put into the unassigned category seem to have large parts that match well to scaffolds in the other assembly. So, I do not understand why this is happening. Could it be that these larger scaffolds are misassembled and ntJoin will not break a target scaffold and put it in separate places? Or, if a scaffold is duplicated and there is already one target scaffold matching the reference?

If I could see the names of the reference and the target that are merged, it may help figure this out, but I do not see any way to match the input reference names with the target names once they are joined (the agp just gives the target names and an ntJoin name).

If you have any insights, I would appreciate it.

Thanks, Erica

lcoombe commented 3 years ago

Hi Erica,

With the options that you have specified, ntJoin will make cuts where one assembly is 'misassembled' with respect to the other assembly - so it can break contigs and incorporate the pieces in different paths. (no_cut=True turns off this behaviour so the target contigs will not be cut, and are incorporated into the best path)

A scaffold being unassigned could possibly happen for a few reasons:

Too many differences in the sequences of the target/reference
Are the contigs that are unassigned repetitive contigs, or perhaps covering the same region of the genome? ntJoin will oly use minimizers/anchors in a contig that are unique (ie. if a minimizer/anchor is duplicated in the assembly or the reference, it won't use it to map the assemblies). So, if contigs overlap a good deal (for whatever reason), any minimizers within the overlap could be seen as repetitive, and not used.
It is also possible if there are a lot of small, local misassemblies in one of the assemblies, this could prevent a scaffold from being assigned. ntJoin uses the positions of the minimizers/anchors to decide how to orient a contig. If looks to see if the minimizers that map to the reference are largely increasing/decreasing in number. By default, it needs 90% of the positions to be consistently increasing/decreasing. That's controlled by the parameter m -- if that's an issue you could try lowering that value to make that check less stringent.

Unfortunately, we track the names of the target names that are joined in the .path file and the agp file, but we don't track the names of the reference sequences that induce the joins. The minimizer graph (.mx.dot) would have the information in it, but it wouldn't be quite as easy to parse. This is the graph that ntJoin uses to make joins in the target - the nodes are minimizers, with edges between adjacent minimizers, and it's in graphviz format. The nodes (minimizers) are annotated with the reference/target scaffold names, so you could potentially search for the contig you were wondering about in that file, and see what reference sequence it corresponds with?

I hope that helps -- let me know if you have any more questions, and thanks for your interest in ntJoin! Lauren

eriled commented 3 years ago

Hi, Thanks for the fast response. Can you please recommend a way for me to visualize the graph? I could not seem to figure out which software to use. I can see something like this from the mx.dot file (using grep -B1 'scf7180000009665'). There are more matches from grep, but these cover the unassembled part.

('contig_713', 146835) ('scf7180000009665', 136462)"]

('contig_1061', 2968455) ('scf7180000009665', 12152318)"]

('contig_701', 1620756) ('scf7180000009665', 3061218)"]

('contig_136', 2134533) ('scf7180000009665', 7065610)"]

These 4 contigs match the first 12.7M bases when I use nucmer to compare my two assemblies and scf7180000009665:0-12760390 is the part of the target contig that doesn't assemble. They seem to be connected, but I suppose there could be other reasons they don't end up together. I still don't really understand this.

In this case, my target scaffold is much larger than the reference scaffolds, would that cause a problem?

I don't see evidence of this from my nucmer alignment, but if 2 contigs are matching to the same place, what happens? (e.g. if it is a highly variable region like MHC)

Thanks for your help, Erica

lcoombe commented 3 years ago

Hi Erica,

I use neato or dot to visualize dot files -- although you would want to do some filtering on the graph prior to doing that. For example, filtering out particular nodes or a neighbourhood of a particular node.

If a target scaffold is larger than the reference scaffolds, that wouldn't cause a problem, but if the reference doesn't provide any additional information of how that large scaffold should be joined to others, then it would not be incorporated into a path, and classified as 'unassigned'. Maybe I should be more clear about that -- scaffolds will only end up in 'assigned' if they are joined to one or more other contigs. If there is no information suggesting additional joins, they will be 'unassigned', but the files are concatenated into the 'all' file, so nothing is lost.

If two contigs align to exactly the same location, it depends on how much sequence variability is between them. If they are true duplicates, then the minimizers in those contigs will not be unique, and will be filtered out by ntJoin. If there is some base variability, it would be possible for some minimizers to pass that filter and potentially be used as scaffolding evidence.

Hope that makes sense! Lauren

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in ntJoin!

bcgsc / ntJoin