chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
272 stars 59 forks source link

heterozygous plant and k-mer list #190

Closed dcopetti closed 4 years ago

dcopetti commented 4 years ago

Hello, I would like to test Shasta on a highly heterozygous plant genome I am assembling. The estimated size is ~2.5 Gb, but the high heterozygosity make it look like it is a 5 Gb genome. The goal is then to have a diploid assembly, one fasta for each allele. I have 226 Gb of PromethION data (called with Guppy 3.0.3, about 45x of 5 Gb genome, N50 15 kb - if I keep e.g reads above 10 kb I still have 31x cov and 25 kb N50), and Flye gave me a 3.7 Gb assembly (N50 700 kb), with about 400 Mb of collapsed haplotypes - likely from homozygous regions that it can't disentangle. I am trying to see if Shasta can resolve those, as well as trying to get above the 4 Gb of assembly size - thus having most of the allelic regions resolved.

I saw you did a lot of work tweaking parameters for a tomato genome, and I was thinking to try a parameter sweep to see what works best. On which ones would you focus at first? Are there other tweaks I could test?

--Align.maxSkip 50
--MinHash.m 3
--MinHash.minBucketSize 3
--MinHash.maxBucketSize 30
--MinHash.minHashIterationCount 30
--MinHash.minFrequency 3

How important is the min read length parameter? Should I try to stay above/around 40x per allele, even at cost of including many reads <10 kb (that could be just contained into others)?

Lastly, I figured out a way to isolate only highly informative k-mers, and I am thinking to supply them as a list. Which format should that file be? just a sequence in each line? I see that you moved from 10 to 14 nt k-mers for the newer basecalled data - good. I have on average about 37 of my specific 21-mers per kb, would you still recommend to go down in size to ~14, or can I stay on larger k-mer sizes? I guess it is OK to have k-mer length of an even number. thanks,

Dario

PS: I see you have a UL version of the conf files and from a quick look it looks like the only difference is min read length. Are those configurations for ultralong reads? I will have other lines with a read N50 of 37 and >50 kb (guppy 4.0.1), would you switch to UL settings for those?

dcopetti commented 4 years ago

After reading the computational methods page, I see that you use compressed k-mers as markers. Do you think it will be an issue if my list of k-mers/markers is not compressed? Or better, do you know of a way to compress Illumina reads prior to k-mer counting?

I think I have a way to get you the "most efficient" set of markers (no repeats, single copy) to get the graph with less branches. The whole Shasta algorithm does not need to be touched, but I think we can improve the specificity of each edge. I'd love to discuss these steps with you, thanks

paoloczi commented 4 years ago

Many things to cover here. Let's address them one at a time.

I would like to test Shasta on a highly heterozygous plant genome I am assembling. The estimated size is ~2.5 Gb, but the high heterozygosity make it look like it is a 5 Gb genome. The goal is then to have a diploid assembly, one fasta for each allele.

Unless the haplotypes are highly diverged, the current Shasta implementation will not be able to separate them, and will assemble a single sequence with heterozygous "bubbles" at locations where the haplotypes diverge. All heterozygous bubbles are then removed by default, but you can use option --MarkerGraph.simplifyLength to keep them in the assembly. However that would not achieve your goal of a fully phased assembly.

Work is in progress to permit haplotype separation, even for the case where haplotypes have high similarity, like for the human genome. However that work is at a preliminary stage.

I have 226 Gb of PromethION data (called with Guppy 3.0.3, about 45x of 5 Gb genome, N50 15 kb - if I keep e.g reads above 10 kb I still have 31x cov and 25 kb N50), and Flye gave me a 3.7 Gb assembly (N50 700 kb), with about 400 Mb of collapsed haplotypes - likely from homozygous regions that it can't disentangle. I am trying to see if Shasta can resolve those, as well as trying to get above the 4 Gb of assembly size - thus having most of the allelic regions resolved.

I strongly recommend redoing the base calling using Guppy 3.6.0 or newer. Guppy 3.6.0 is a huge improvement from previous versions, and the resulting reads are going to be much more accurate.

I saw you did a lot of work tweaking parameters for a tomato genome, and I was thinking to try a parameter sweep to see what works best. On which ones would you focus at first? Are there other tweaks I could test?

--Align.maxSkip 50
--MinHash.m 3
--MinHash.minBucketSize 3
--MinHash.maxBucketSize 30
--MinHash.minHashIterationCount 30
--MinHash.minFrequency 3

This is a good start, but add to it at least --Align.minAlignedFraction and --Align.minAlignedMarkerCount . Work is in progress to automate the selection of some of these parameters. Config file shasta/conf/Nanopore-Dec2019.conf should be your starting point, or shasta/conf/Nanopore-Jun2020.conf if you switch to Guppy 3.6.0 or newer.

The latest code provides a new option (--ReadGraph.creationMethod 2) which uses alignment statistics to choose appropriate thresholds (see shasta/conf/Nanopore-Sep2020.conf in the latest code). This functionality is experimental and not available in a released version of Shasta, but you can easily download a current test build as described here.

How important is the min read length parameter? Should I try to stay above/around 40x per allele, even at cost of including many reads <10 kb (that could be just contained into others)?

40x per allele should be a comfortable place to be. If your genome does not have long repeats you could consider decreasing your read length cutoff to get you to around 60x per allele.

Lastly, I figured out a way to isolate only highly informative k-mers, and I am thinking to supply them as a list. Which format should that file be? just a sequence in each line?

Yes, one sequence per line, and use --Kmers.generationMethod 3 --Kmers.file absolutePathToFileName. The file name must be specified as an absolute path. Each line of the file should contain a Run-Length Encoded (RLE) k-mer (no repeated bases). Note that including a k-mer will result in the reverse complement also being used as a marker. I should warn you that we have never been able to improve assembly results by changing the choice of k-mers to be used as markers.

I see that you moved from 10 to 14 nt k-mers for the newer basecalled data - good. I have on average about 37 of my specific 21-mers per kb, would you still recommend to go down in size to ~14, or can I stay on larger k-mer sizes? I guess it is OK to have k-mer length of an even number.

Only for Guppy 3.6.0 (or newer). For older reads I recommend staying at 10. The maximum value supported is 15, and it does not need to be even. These are RLE k-mers. On average (for random sequence), each RLE base corresponds to 4/3 bases. So 10-mers in RLE correspond to 13-mers on average.

I see you have a UL version of the conf files and from a quick look it looks like the only difference is min read length. Are those configurations for ultralong reads? I will have other lines with a read N50 of 37 and >50 kb (guppy 4.0.1), would you switch to UL settings for those?

There are more differences than just the read length cutoff. I would not use UL settings for your reads.

After reading the computational methods page, I see that you use compressed k-mers as markers. Do you think it will be an issue if my list of k-mers/markers is not compressed?

I answered this above. The k-mers need to be RLE k-mers.

Or better, do you know of a way to compress Illumina reads prior to k-mer counting?

Compressing k-mers is trivial - just remove repeated bases. But perhaps I misunderstand your question here.

I think I have a way to get you the "most efficient" set of markers (no repeats, single copy) to get the graph with less branches. The whole Shasta algorithm does not need to be touched, but I think we can improve the specificity of each edge. I'd love to discuss these steps with you, thanks

We tried things like this a few times, always without seeing an improvements on assembly quality. But we were not trying to use k-mer selection for the purpose of haplotype separation, so your situation could well be different.

For private interaction, feel free to contact me at the e-mail address indicated on our paper (the Shasta README contains a link to the paper).

paoloczi commented 4 years ago

One more thought: you can use the Shasta http server functionality to explore your assembly results. In particular, looking at the read graph will help you assess the amount of haplotype separation achieved. See here for details. Given that you have a small assembly, I suggest running your assembly with --memoryMode filesystem --memoryBacking disk.

paoloczi commented 4 years ago

I am closing this due to lack of discussion. Feel free to reopen it, or create a new issue, if additional discussion topics or questions emerge.