short genome reconstruction

BioRB commented 4 years ago

Dear Developers, I-m trying to use Shasta to reconstruct a genome of 7500bp {viral genome}. I did a targeted sequencing using MinION, this my data are constituted by very redundant information with amplicons of 7500 bp in average size. I used shasta with default parameters obtaining good results but I didn-t managed to obtain the whole genome, just several long contigs representative of the viral genome. Do you have some suggestion about the parameters to use to optimize this reconstruction with shasta? I-ve used a min read size threshold of 5000bp or 2000bp for the moment but always getting several contigs. I need to obtain a single draft genome. thanks. Best, RB

BioRB commented 4 years ago

Additionally, We discovered that, the whole genome is generated except a big gap of 400 bp in a portion of the sequence that could correspond to the beginning and ends of the reads (according to our experimental conditions). maybe it is due to a lower quality of the bases in those regions. Is it possible to lower the quality threshold? and if yes, this could affect the average quality of the reconstruction? thanks

BioRB commented 4 years ago

Using the parameter --Align.alignMethod 1 gives an error: Invalid option: option '--Align.alignMethod' is ambiguous and matches '--Align.alignMethodForMarkerGraph', and '--Align.alignMethodForReadGraph'

paoloczi commented 4 years ago

Shasta default parameters are optimized for coverage around 60x. In your case coverage is probably much higher, and so you should increase the read length cutoff (--Reads.minReadLength) until coverage used is around 60x. To compute that, look at AssemblySummary.html and find the field named Number of raw sequence bases. This is the number of bases actually used in that assembly. Divide that by the estimated length of your genome to obtain the actual coverage used. You mentioned that you experimented with --Reads.minReadLength, so you might already have done this. Just make sure Shasta is operating a the recommended coverage for default parameters.

Alternatively, it is probably possible to optimize assembly parameters for high coverage (or for low coverage, if that is where you are operating). But this is not necessarily trivial and we have not done that. There is some discussion of this on Shasta issue #7 but no conclusion.

However your comment that you get a fragmented assembly may be an indication that you are actually operating at low coverage. If you are, use the same process as above - this time decreasing --Reads.minReadLength. If you don't have sufficient coverage, you may need to decrease --MarkerGraph.minCoverage and perhaps some of the other parameters mentioned in issue #7. Option --MarkerGraph.minCoverage controls the minimum number of reads for a marker graph vertex to be generated.

You may also be able to get better results using configuration file shasta/conf/Nanopore-Dec2019.conf, instead of default parameters. You can get that by downloading it from the repository. To specify it to the assembler, use --conf /absolutePathTo/Nanopore-Dec2019.conf - that is, you need to specify an absolute path. A relative path will not work.

The lack of assembled sequence at the beginning and end of your genome could be due to low read quality near their ends, and/or also to low coverage in those regions. By default, Shasta prunes a length of 6 markers corresponding to about 80 bases with default parameters. This could explain part of the problem. To suppress this pruning, use --MarkerGraph.pruneIterationCount 0.

If none of the above helps, here are a couple of additional suggestions:

Post the following output files for one of your assemblies: AssemblySummary.html, LowHashBucketHistogram.csv, Assembly-BothStrands.gfa, plus the assembly log (stdout). This will make it easier for me to get an idea of what is happening in the assembly.
Use Shasta http server functionality to explore the details of your assembly. For this to be fruitful, some knowledge of Shasta computational methods is necessary.
Alternatively, if you are willing to share your data, I will be happy to experiment with it and make suggestions. We can do this privately if you prefer.

As explained in the documentation, the documentation posted on GitHub Pages refers to the latest code in the repository. If you are using Shasta Release 0.4.0, the easiest way to get documentation that applies to that release is to expand one of the tar files that come with the release. You may also want to wait for Release 0.5.0, which is imminent. Alternatively, you can download a current test build as explained here. That will be in sync with the documentation you see on GitHub Pages. In the latest code, the two options --Align.alignMethodForMarkerGraph and Align.alignMethodForReadGraph became a single option Align.alignMethod.

bagashe commented 4 years ago

@BioRB : A new version of Shasta was just released. You can find it at https://github.com/chanzuckerberg/shasta/releases/tag/0.5.0

paoloczi commented 4 years ago

I am closing this issue due to lack of discussion in the last week. Please feel free to reopen it or create a new one if new topics of discussions emerge.

chanzuckerberg / shasta

short genome reconstruction #150