very small assembly size - low coverage?

dcopetti commented 4 years ago

Hello, I tried assembling a haploid plant genome (estimated size 2.5 Gb) from raw ONT reads (28x, called with Guppy 4.0.11, N50 62 kb, mean QV 10.3, all reads >QV7 and >2 kb) with the following command: shasta-Linux-0.5.1 --input reads.fq --threads 20 --assemblyDirectory kyuss_OS1 --config Nanopore-Jun2020.conf --Reads.minReadLength 2000 &>kyuss_OS1_stdout the assembly I get is less than 2.7 Mb - here attached is the html file. What can be the cause for such small size? thanks, Dario

kyuss_OS1_AssemblySummary.zip

paoloczi commented 4 years ago

Yes, the assembly summary shows that coverage is too low. There are under 20,000 alignments for over 2 million reads, and as a result over 98% of the reads are isolated in the read graph. This agrees with our experience that, at least with standard assembly parameters, we have not been able to obtain satisfactory assemblies under coverage 40X or so.

A new release is imminent, and it includes various improvements plus new configuration files which might help in your case because they reduce sensitivity to coverage. If you don't want to wait for the release, you can follow the directions here to download a current test build. Then run using configuration file Nanopore-Sep2020.conf or Nanopore-OldGuppy-Sep2020.conf if you have reads created by a Guppy version older than 3.6.0.

paoloczi commented 4 years ago

Shasta 0.6.0 was released today, and you can use it according to the suggestions in the above comment.

paoloczi commented 4 years ago

I am closing this due to lack of discussion. Feel free to reopen it or create a new issue if needed.

dcopetti commented 3 years ago

Just FYI to complete the discussion: I was able to assemble the dataset with the 0.6.0 version - it was superfast compared to Flye! Stats are a bit lower though: total size 2.13 Gb (Flye 2.28 Gb) N50 1.7 Mb (11.7 Mb) N70 1.1 Mb (7.3 Mb) N90 503 kb (3.3 Mb) BUSCOs: complete 83.7% (90.4, two internal rounds of polishing) fragmented 4.2% (3.5%) missing 12.1% (6.1) The input file was the same, Shasta parameters were --config Nanopore-Sep2020.conf --Reads.minReadLength 2000. Given that raw read coverage (~28x) may be the issue (and we are not producing more data for this), do you think there is something else I could tweak to bring stats up? Mostly to explore the potential of the software at this point, since I am quite happy with the Flye assembly. Thanks!

paoloczi commented 3 years ago

A few thoughts:

We have often seen Flye assemble more sequence, and with better contiguity, but at the cost of many additional assembly errors. See this comparison https://github.com/human-pangenomics/assembly-analysis for an example. This is on old data and with an old Shasta version, but will give you the idea.
Regarding contiguity: is the assembly messy or fragmented? You can find out by looking at Assembly-BothStrands.gfa in Bandage. "Messy" means that N50 is limited by branches in the graph, "fragmented" means that it is limited by dead ends. Potential cures are different depending on which of the two situations you are in.
We generally don't run BUSCO analysis on unpolished assemblies. Sequence quality is a bit below what is needed to give meaningful results with that, so BUSCO metrics on unpolished assemblies are a poor indicator of quality. After polishing, it's a different story (see our paper https://www.nature.com/articles/s41587-020-0503-6).
I would try increasing the read length cutoff to the default 10 Kb. Your reads are quite long, so you will not lose much coverage if you do that, and using the shorter reads can have a negative effect on contiguity.
If you send me AssemblySummary.html I may be able to give additional suggestions, but perhaps do an assembly with increased read length cutoff first?

Paolo

On Wed, Oct 21, 2020 at 10:13 AM Dario Copetti notifications@github.com wrote:

Just FYI to complete the discussion: I was able to assemble the dataset with the 0.6.0 version - it was superfast compared to Flye! Stats are a bit lower though: total size 2.13 Gb (Flye 2.28 Gb) N50 1.7 Mb (11.7 Mb) N70 1.1 Mb (7.3 Mb) N90 503 kb (3.3 Mb) BUSCOs: complete 83.7% (90.4, two internal rounds of polishing) fragmented 4.2% (3.5%) missing 12.1% (6.1) The input file was the same, Shasta parameters were --config Nanopore-Sep2020.conf --Reads.minReadLength 2000. Given that raw read coverage (~28x) may be the issue (and we are not producing more data for this), do you think there is something else I could tweak to bring stats up? Mostly to explore the potential of the software at this point, since I am quite happy with the Flye assembly. Thanks!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/shasta/issues/200#issuecomment-713725687, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBXCTODVGRAQFR5THE3EWDSL4JCDANCNFSM4SEK47JA .

dcopetti commented 3 years ago

Hi, Sorry for following up that late. Indeed, having short reads (and more bases) is detrimental to Shasta's assemblies. I did some assemblies with sets of different minimum read length

and in my case it looks like until 14 kb the less/longer reads, the better, then the relationship actually flips with total assembly size and N50 being negatively correlated at the increase of min read length (following the line). So there is definitely a sweet spot for Shasta's input file (and it would be easy to find with so fast assemblies), though the metrics are still very far away from Flye (after looking at gene collinearity with a close relative, I found only one misassembly). So now we know :-)

paoloczi commented 3 years ago

It is probably possible to get better assemblies with some tuning of assembly parameters. But first I would upgrade to the most recent Shasta version, 0.7.0, released in December, rather than release 0.6.0 which I think you are using. Release 0.7.0 includes a new configuration file, Nanopore-Sep2020.conf, which usually does a better job than configuration file Nanopore-Jun2020.conf you I think you are using, based on the above conversation.

Even after you do that, keep in mind that Nanopore-Sep2020.conf works best for human assemblies at coverage around 60x, and when working under different conditions we usually have to change some of the options to get optimal assemblies. The complex curve you obtained is probably the result of the need to tune some other assembly parameters as you change the read length cutofff.

If you run an assembly with 0.7.0 and Nanopore-Sep2020.conf, please post AssemblySummary.html plus the log output of the assembly, and I may be able to suggest changes in assembly parameters.

dcopetti commented 3 years ago

Yes, those assemblies were made last fall and I just wanted to post these results to help others with similar limitations and organisms. Unfortunately I don't have time to work more on this now, but it would be nice if in the future we can use Shasta for assembling plant genomes at comparable levels than other software and with limited parameter sweeps. Thank you for developing it though, it is using a very nice principle.

paoloczi commented 3 years ago

If the reads are publicly available and/or you can post them somewhere, I could do some experimentation and report here.

paoloczi commented 3 years ago

@dcopetti kindly made his reads available, which allowed me to do some experimentation. It appears that these reads have lower accuracy than similar reads from human genomes. After some adjustments of assembly parameters to account for the lower accuracy I was able to obtain a 2.185 Gb assembly with 5.5 Mb N50. I created a new configuration file shasta/conf/Nanopore-Plants-Apr2121.conf with the parameters I used for this assembly. See the comments in that configuration file for more details.

I don't know how transportable this new configuration file is to other plant genomes, but it is a start.

chanzuckerberg / shasta

very small assembly size - low coverage? #200