Closed eriled closed 3 years ago
Hi Erica,
With the options that you have specified, ntJoin will make cuts where one assembly is 'misassembled' with respect to the other assembly - so it can break contigs and incorporate the pieces in different paths. (no_cut=True
turns off this behaviour so the target contigs will not be cut, and are incorporated into the best path)
A scaffold being unassigned could possibly happen for a few reasons:
m
-- if that's an issue you could try lowering that value to make that check less stringent.Unfortunately, we track the names of the target names that are joined in the .path
file and the agp
file, but we don't track the names of the reference sequences that induce the joins. The minimizer graph (.mx.dot
) would have the information in it, but it wouldn't be quite as easy to parse. This is the graph that ntJoin uses to make joins in the target - the nodes are minimizers, with edges between adjacent minimizers, and it's in graphviz format. The nodes (minimizers) are annotated with the reference/target scaffold names, so you could potentially search for the contig you were wondering about in that file, and see what reference sequence it corresponds with?
I hope that helps -- let me know if you have any more questions, and thanks for your interest in ntJoin! Lauren
('contig_136', 2134533) ('scf7180000009665', 7065610)"]
These 4 contigs match the first 12.7M bases when I use nucmer to compare my two assemblies and scf7180000009665:0-12760390 is the part of the target contig that doesn't assemble. They seem to be connected, but I suppose there could be other reasons they don't end up together. I still don't really understand this.
In this case, my target scaffold is much larger than the reference scaffolds, would that cause a problem?
I don't see evidence of this from my nucmer alignment, but if 2 contigs are matching to the same place, what happens? (e.g. if it is a highly variable region like MHC)
Thanks for your help, Erica
Hi Erica,
I use neato
or dot
to visualize dot files -- although you would want to do some filtering on the graph prior to doing that. For example, filtering out particular nodes or a neighbourhood of a particular node.
If a target scaffold is larger than the reference scaffolds, that wouldn't cause a problem, but if the reference doesn't provide any additional information of how that large scaffold should be joined to others, then it would not be incorporated into a path, and classified as 'unassigned'. Maybe I should be more clear about that -- scaffolds will only end up in 'assigned' if they are joined to one or more other contigs. If there is no information suggesting additional joins, they will be 'unassigned', but the files are concatenated into the 'all' file, so nothing is lost.
If two contigs align to exactly the same location, it depends on how much sequence variability is between them. If they are true duplicates, then the minimizers in those contigs will not be unique, and will be filtered out by ntJoin. If there is some base variability, it would be possible for some minimizers to pass that filter and potentially be used as scaffolding evidence.
Hope that makes sense! Lauren
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in ntJoin!
Hi, I am trying to improve my denovo assembly with ntJoin v1.0.3. For the most part, it seems to be doing what I had hoped, but I some results that I do not understand. I have 2 assemblies - 1 long-read only Oxford Nanopore assembled with flye and polished with polca (using Illumina reads), and one hybrid assembly using Masurca (and the same data). My contig/scaffold count was 1740 for the flye assembly and 2760 for the hybrid assembly. When I use the longer assembly as the reference with ntJoin, I get 252 scaffolds, but in the unassigned scaffolds, I have several very large scaffolds (or parts of scaffolds) that are not being assigned (12M, 8M, 5M). I am just using the default parameters: ntJoin assemble t=4 target=hybrid_oneline.fa target_weight=1 references=Flye_PolcaCorrected_oneline.fa // reference_weights=2 k=32 w=1000 agp=True prefix=hybVflye
I had compared both assemblies with each other as well as separately to a reference genome of a closely related species using nucmer. I can see by comparing the 2 nucmer outputs that these large scaffold parts that are being put into the unassigned category seem to have large parts that match well to scaffolds in the other assembly. So, I do not understand why this is happening. Could it be that these larger scaffolds are misassembled and ntJoin will not break a target scaffold and put it in separate places? Or, if a scaffold is duplicated and there is already one target scaffold matching the reference?
If I could see the names of the reference and the target that are merged, it may help figure this out, but I do not see any way to match the input reference names with the target names once they are joined (the agp just gives the target names and an ntJoin name).
If you have any insights, I would appreciate it.
Thanks, Erica