chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
272 stars 59 forks source link

Good assembly result but many <100bp contigs #234

Closed ASLeonard closed 3 years ago

ASLeonard commented 3 years ago

Hi, I've been having some great results with shasta, but had some questions on the generation of many small contigs. This is done on haplotype-binned reads of a 2.7gb mammal with about 30x coverage, with the Sep2020 config.

I've included a log-scaled histogram of the segment lengths present in Assembly.fasta. Nearly 97% of the contigs form 1% of the sequence length, with many contigs of lengths [16-100]. image

When looking through additional parameters to tweak, the main one looks like Assembly.pruneLength. The other option, ReadGraph.minComponentSize says it is currently ignored. I certainly don't want to remove meaningful but small sequence, but contigs that are tiny fractions of even a single read feel reasonably to remove. However, I wasn't sure if there are better suggestions or if it is better to prune the marker graph more aggressively. I've included the AssemblySummary.json file for reference. AssemblySummary.txt

Thanks, Alex

paoloczi commented 3 years ago

The many short assembled segments ("contigs") probably represent repeats that were not resolved by the assembly. Even though these segments are useless on their own, they still contain information if used in conjunction with the assembly graph, which represents possible sequences in which assembled segment could follow each other in the genome you are assembling.

If you want to pursue this, I suggest that you use Bandage to look at AssemblyGraph-BothStrands.gfa for your assembly. Bandage is easy to install and use. After installing, just load the gfa file and click on "Draw graph". I also suggest setting "Arrowheads in single node style" under Tools/Settings. In the resulting display, each assembled segment is shown as a line with an arrow showing its direction, and followed or preceded by other assembled segments according to the assembly graph. This connectivity often contains lots of useful information. Consider, for example, a situation like the following:

image

This tells us that the 1.1 Mb segment follows the 154 Kb segment, with some intervening sequence in the middle that could not be meaningfully assembled. If you disregard the connectivity information in the assembly graph, you have no way of knowing that those two assembled segments are probably near each other, in the order shown, in the genome being assembled.

If you have no use for this type of information, it is entirely fine to do one of two things:

paoloczi commented 3 years ago

One more comment. Note that the assembly is actually more contiguous than the lengths of assembled segments tell you. For example, in the picture I posted, you could say you have a 1.3 Mb segment assembled contiguously, with some unresolved sequence in the middle - rather than two unrelated segments 154 Kb and 1.1 Mb long. It is very possible that, if you use this kind of information, the "effective" N50 for you assembly is actually better than 70 Mb.

For this reason, in Shasta I don't use the word "contig" and instead use gfa terminology "segment". The word "contig" erroneously suggests that a segment represents the maximum length that the assembly was able to put together at each particular location, which is not the case if you use assembly graph connectivity information.

And I forgot to tell you that I had a look at the assembly summary you posted and everything looks sane, as expected given the nice assembly results.

ASLeonard commented 3 years ago

Great, thanks for the additional tips!