Local assembly using Shasta

ctl43 commented 2 years ago

Thanks for making this nice tool for the nanopore community. Currently, I am developing a web and dry lab protocol and would like to use Shasta as a major component in my dry lab part, assembling sequencing reads locally (within a region only).

However, I am experiencing some difficulties in that the assembly result does not fully recapitulate the alignment result (see attached figure).

The assembly does not extend to what it should be (see blue arrows).
Some assemblies are short and they do not exist in sequencing reads (see red box).

Here is the command I used shasta-Linux-0.8.0 --input nanopore.fq --config Nanopore-Oct2021 --Reads.minReadLength 1000 --Align.maxTrim 3 --MarkerGraph.minCoverage 5.

Would you mind pointing me to a direction that I can result in the expected sequence? Or it is the nature of Shasta that it is not suitable for this kind of task?

Thanks! shasta

paoloczi commented 2 years ago

Given that you are working under unusual conditions, it is likely that the standard assembly configurations (such as Nanopore-Oct2021) will not work "out of the box", and some optimization of assembly parameters will be necessary.

Can you give more information on your assembly conditions? How wide is the region that you are trying to assemble? How much coverage do you expect? Are you using Guppy 5 with "super" accuracy for basecalling? Is this a diploid genome? What is the scale of the region displayed in the picture you posted?

With more information I may be able to give some suggestions on assembly parameters. And if you are willing to share privately one of the datasets you are trying to assemble I will be happy to experiment with it and then propose an optimal assembly configuration. You can get my e-mail address from the 2020 paper listed in the Shasta README - that is, the top level page of the GitHub repository.

ctl43 commented 2 years ago

Thanks for your prompt reply and kind help. Here is more information.

How wide is the region that you are trying to assemble? The length of the regions varied in the genome ranging from 10kb to 100kb. The region that I showed here is the 3kb in a ~25kb region (zoom in for display).
How much coverage do you expect? at least 20x, the region I showed is ~80x.
Are you using Guppy 5 with "super" accuracy for basecalling? Unfortunately, I am just using some old data generated in 2021, at the time still using Guppy 4, that is why so many indels and missense bases. But you are right, it makes no sense to use an old basecaller. I will definitely basecall it again.
Is this a diploid genome? It is a human sample, so yes, it is diploid.
What is the scale of the region displayed in the picture you posted? Around 3 kb.

I am more than happy to privately share a subset of data for you to experiment with it once I have basecalled it again. Do you prefer Guppy 5 or 6, which was just released on 03/30?

paoloczi commented 2 years ago

Guppy 5 and 6 should give approximately equivalent results, so you might as well use Guppy 6. But make sure to use "super" accuracy because that is the option that delivers the substantial accuracy improvements relative to Guppy 4. It will make the base calling run slower, however. When you have the new reads contact me via e-mail - see my first comment for how to locate my e-mail address.

In the meantime, while still with Guppy 4 reads, you might be able to improve your assembly by just using --config Nanopore-Oct2021, without the additional options you used (--Reads.minReadLength 1000 --Align.maxTrim 3 --MarkerGraph.minCoverage 5). However, like you pointed out there is no value in optimizing your process for these old data.

ctl43 commented 2 years ago

Contacted through Email.

paoloczi commented 2 years ago

I prefer to keep this open for now. When it gets resolved we can post a quick summary.

paoloczi commented 2 years ago

@ctl43 explained that, in this application, the reads were selected in such a way that about half of the reads end immediately to the left of a small gap, and the remaining half begin immediately following that same gap. The gap was expected to be around 12-14 bases. Under these conditions one does not expect an assembly in which the two sides of the gap are assembled together. Rather, one expects two separate and disjoint assembled segments ("contigs"), with the goal of assembling as close as possible to the gap on both sides.

In @ctl43's original assembly, the gap between the two assembled sequences was too large, around 200 bases. I looked at the details of the assembly and found two causes for that:

The Shasta assembly has a step that prunes dead ends of the marker graph. The motivation for this is that in typical assemblies those sections of assembled sequence are unreliable. Suppressing that step via command line option --MarkerGraph.pruneIterationCount 0 reduced the gap down to about 30 bases.
The second cause for the larger gap was that Shasta uses markers for assembly. In the standard assembly parameters that @ctl43 used, the marker density was 10%, which means that on the average one marker occurs every 10 bases in RLE sequence or about 14 bases in raw base sequence. The assembly has to begin and end at one of these marker locations, so we can expect a loss of about 7 bases (with high variability) on each side. The solution was to increase the marker density. I suggested increasing the marker density to 20% via command line option --Kmers.probability 0.2, and that reduced the gap to 14 bases. That was satisfactory for the test case @ctl43 gave me, but @ctl43 found that to solve all of his cases he needed to use 100% marker density via --Kmers.probability 1 - a choice that would be prohibitive for performance in a full genome assembly, but that only had a small performance cost for these small assemblies.

The performance problems initially reported by @ctl43 were due to the initial selection of 14-base markers. This was taking about 15 seconds which would be negligible for a full genome assembly, but was dominant in these small assemblies. To solve that I suggested reducing marker length to 12 bases via command line option --Kmers.k 12. That reduced assembly time down to just a few seconds without negative effects on the assembly.

In summary, a satisfactory solution was achieved with the use of the following command line:

shasta --input sample_data.fa --config Nanopore-Oct2021 --Reads.minReadLength 1000 --Kmers.k 12 --Kmers.probability 1 --MarkerGraph.pruneIterationCount 0

I will leave this issue open for now for visibility and to facilitate discussion.

paoloczi commented 2 years ago

I am closing this due to lack of additional discussion.

chanzuckerberg / shasta

Local assembly using Shasta #286