Closed EvoEpi closed 4 years ago
Did you adjust --Reads.minReadLength
to make sure Shasta is using the correct amount of coverage?
Do you know if your assembly is fragmented or messy? That is, do most assembled segments begin/end a a dead end, or at a branch involving other segments? You can find out by looking at the assembly graph using Bandage. The parameter changes to be made will depend on the answer to this question.
It would also help if you could tell us the expected size of the genome you are assembling and post the following files from your assembly directory:
AssemblySummary.html
LowHashBucketHistogram.csv
Binned-ReadLengthHistogram.csv
stdout
) from the assemblyIt is also possible that you are hitting limitations imposed by the length and quality of your reads, but we will know better once we have the above answers and information.
Thanks for getting back to me!
--Reads.minReadLength
was set to 10000.
The input data looks pretty good:
I still need to look into Bandage
.
Estimated genome size is 2.2-2.5 Gbp.
Here are the requested files: Archive.zip
From AssemblySummary.html
I see that the Number of raw sequence bases
is 56 Gb. And the section entitled Reads discarded on input
shows that 72 Gb of coverage were discarded because the reads were too short, that is, shorter than the 10 Kb threshold you used.
Based on an estimated genome size of 2.4 Gb, this assembly is operating at coverage 23x which is too low to give good assembly results - at least with default assembly parameters. We know that at this coverage and using default parameters the assembly is generally fragmented. The contig lengths you are observing are entirely consistent with this.
So you will need to reduce the read length cutoff, probably down to 3-5 kb. Immediately after the reads load, you can look at line 2 of Binned-ReadLengthHistogram.csv
, column CumulativeBases
, without waiting for the assembly to complete, to see how much coverage is being used.
If the read length is reduced substantially you may also need to reduce --Align.minAlignedMarkerCount
from its default 100 to perhaps 50 to get enough alignments. However, doing that reduces the assembler's ability to resolve repeats. But I would first try just reducing the read length cutoff and leaving this parameter at 100 initially.
If you still cannot get a satisfactory assembly please post an update here.
Reducing the read length cutoff to 3kb and/or the Align.minAlignedMarkerCount
from 100 to 50 does not improve contig length.
I am running into a 'read-only filesystem error' that has prevented me from using --memoryMode filesystem
and --memoryBacking 2M
––I am trying to solve this issue. Would setting these parameters to anonymous
and 4K
, respectfully, be contributing to poor assembly?
No, these parameters will not affect assembly quality, but only performance. It is possible that you are hitting limitations imposed by the length and quality of your reads, but if you want to pursue this further please post the same files I requested earlier, for the assembly with 3 Kb read length cutoff.
I am closing this due to lack of discussion. Feel free to reopen it or create a new issue as needed.
Hi!
I am noticing a pattern of assembled segment lengths maxing out at 2-3 Mb regardless of PacBio SequelI CLR coverage (50-70X).
I am following the suggested parameters in the config file from git. However I am using
Modal
consensus caller,anonymous
for memory mode, and4K
for memory backing.Any suggestions on how to increase the length of assembled segments?
Thank you!