chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
270 stars 57 forks source link

assembly half the size it should be and low N50 #295

Closed rob234king closed 1 year ago

rob234king commented 2 years ago

I have a fungal genome 30.5Mbp size and reduced nanopore coverage to 70X after filtering. I've filtered for at least average quality 16 and 3000bp read length. I take that up to 5000bp and around 70X coverage. The species is heterozygous between haplotypes and I want an assembly with both haplotypes represented.

A flye assembly using bandage should give complete chromosomes if can get the correct haplotype sequence with each haplotig but I was hoping shasta would do better and produce haplotigs without me manually using bandage.

command:

shasta-Linux-0.10.0 --input ../raw_data/400K_Q16_3000.fastq --config Nanopore-May2022 --assemblyDirectory non_size_A_shasta_400K_min5kA --threads 20 --Reads.minReadLength 5000

output AssemblySummary.zip

Can you recommend what settings to change to get genome size that it should be and additional to get haplotigs. I see these two configs are were trying and tweaking but I first wanted to sort out any basic option I am missing. Nanopore-UL-iterative-Sep2020.conf Nanopore-UL-Phased-Jan2022.conf

paoloczi commented 2 years ago

We usually don't work with reads that short and stay with the 10 Kb read length cutoff, but at 70X coverage it should be possible to get a better assembly.

From your description I understand that this is a diploid 30.5 Mb genome (that is, 30.5 Mb per haplotype). Please confirm. Do you have an estimate of heterozygosity? That affects the assembly strategy if you want to separate haplotypes.

Can you repost AssemblySummary.html? Zipped form is ok as GitHub does not allow posting html files, but the zip file you posted appears to be of zero length.

The configurations with UL in their name are for "Ultra-Long" reads with N50 of 50 Kb or more, so don't use those. I have experimented with fungal assemblies in the past, and if I remember correctly, some tweaks to the assembly configurations resulted in big improvements in the assembly. With more information I should be able to make suggestions.

paoloczi commented 2 years ago

One more thing. Shasta haploid assembly does aggressive bubble removal. If this genome has high heterozygosity, the assembly graph could contain large bubbles of which Shasta only keeps one side. This could be responsible for the decreased genome size, but I need more information to know if this is what is happening. There are options to tweak the bubble removal process.

rob234king commented 2 years ago

AssemblySummary.zip

We reached our disk quota so I couldn't make a zip file, fixed it now.

rob234king commented 2 years ago

I haven't a metric for how heterozygous it is, can you recommend a tool for that. I've just looked in IGV to see how different the two haplotypes are. There is a good reference available already for this species so I can benchmark how well I do. igv_snapshot

rob234king commented 2 years ago

My reads should have a 2% error profile and genome appears very heterozygous but should have 30.5Mbps per haplotype

paoloczi commented 2 years ago

The assembly summary shows that 76% of the reads are isolated in the read graph, and under these conditions we cannot expect a good quality assembly. This is probably due to a combination of MinHash criteria and alignment criteria, which resulted in an insufficient number of alignments. This is not too surprising as the combination of shorter than usual reads of higher than usual accuracy is not something that we have experimented with in the past.

Some tweaks to assembly parameters are going to be necessary. I could give some suggestions here, but I think this will require a few iterations, and it would be more efficient if I do it, if you are in a position of privately sharing the reads with me. If you would like to do this, please e-mail me using the e-mail address in the Shasta paper linked from the top level README file in the Shasta repository.

Otherwise, let me know and I can give you some suggestions for the next assembly iteration.

Thank you for the IGV plot. It is informative and indicates that the assembly graph should contain at least some large heterozygous bubbles. So, once the other assembly parameters are fixed, it will probably be necessary to use a less aggressive than usual bubble removal process.

paoloczi commented 2 years ago

To confirm my above comment regarding an insufficient number of alignments: that assembly has about 279,000 alignments for 252,000 reads, so just more than one alignment per read. In a healthy assembly, we usually need at least 5 alignments per read.

The low number of alignments could be due to one of two causes:

rob123king commented 2 years ago

I'm working on getting permission to share data. And hopefully provide a better data set too but taking a little time to sort out. Thanks for your help.

paoloczi commented 1 year ago

Shasta development moved to a new repository (see the README for more information). I created a new issue in the new repository, paoloshasta/shasta#1, to reflect this request. If additional discussion is needed, let's continue it there.