Configure file on Q20 reads

zyj1729 commented 2 years ago

Hi, I'm trying to run Shasta on a ONT Q20 dataset. I‘m wondering if there is another configure for Q20 besides Nanopore-Oct2021? Because I used Nanopore-Oct2021 and the Assembly.fasta file size is only 2.6G (2704149439). Thanks.

paoloczi commented 2 years ago

Unfortunately we don't yet have a Shasta configuration for Q20 reads. I did run some assemblies with Q20 reads but those were some time ago and probably therefore irrelevant for your purposes, as Q20 reads have been changing rapidly.

I assume this is for a human genome. How much coverage do you have? If you are using Nanopore-Oct2021, the assembly discards all reads shorter than 10 Kb, so this might have reduced you coverage. This is possible as the Q20 reads I have seen so far were on the short side.

Please post here AssemblySummary.html from your assembly. Among other things, that will tell us how much coverage the assembly was using, and it might give me other clues. The following additional files could also be useful to help diagnose what is going on in your assembly:

LowHashBucketHistogram.csv
DisjointSetsHistogram.csv
Binned-ReadLengthHistogram.csv

If you are at very low coverage, a smaller than usual amount of assembled sequence is expected. In that case, you could consider reducing --Reads.minReadLength a bit.

zyj1729 commented 2 years ago

Thanks for the feedback! Yes it's a human genome. Our read depth is 30X. I saved the AssemblySummary.html to a pdf file and posted here. AssembleSummary.pdf

paoloczi commented 2 years ago

Actually from that assembly summary I see that the assembly is using 183 Gb of reads (after discarding 41 Gb of reads shorter than 10 Kb). So you are around 60X, not 30X. So coverage is not the explanation.

There is a higher than usual fraction of isolated reads in the read graph, so something needs to be tweaked either in the MinHash parameters or in the alignment criteria. I might be able to get a better idea if you post the additional files I listed above. You can zip them in a single zip file, and GitHub will allow you to post that.

zyj1729 commented 2 years ago

Hi, sorry for the late reply. I put the zip file below. Thanks! additional.zip

paoloczi commented 2 years ago

It looks like at least some changes to MinHash parameters will be needed. The Nanopore-Oct2021 configuration has the following:

[MinHash]
minBucketSize = 5
maxBucketSize = 30

But from the LowHashBucketHistogram.csv you attached I see that the bucket population is shifted to higher coverage:

Given this, I suggest trying the following options for your next assembly:

--MinHash.minBucketSize 30 --MinHash.maxBucketSize 70

Additional changes of assembly parameters may still be needed in addition to the above, but it is hard to tell in advance given the significant differences between these Q20 reads and previous ONT reads. If you post the results of your next assembly I may be able to provide additional suggestions.

I anticipate that we will soon provide a Q20 config in an upcoming Shasta release. You may choose to wait for that. Alternatively, if you are able to obtain a satisfactory assembly, please post here the options you use, so we can use your experience as a starting point to create the Q20 assembly configuration.

paoloczi commented 2 years ago

I added the enhancement label as a reminder that this is a request for a config usable for Q20 reads.

paoloczi commented 1 year ago

Shasta development moved to a new repository (see the README for more information). I created a new issue in the new repository, paoloshasta/shasta#1, to reflect this request. If additional discussion is needed, let's continue it there.

chanzuckerberg / shasta

Configure file on Q20 reads #273