Assembled segment length maxed out at 2-3 Mb

EvoEpi commented 4 years ago

Hi!

I am noticing a pattern of assembled segment lengths maxing out at 2-3 Mb regardless of PacBio SequelI CLR coverage (50-70X).

I am following the suggested parameters in the config file from git. However I am using Modal consensus caller, anonymous for memory mode, and 4K for memory backing.

Any suggestions on how to increase the length of assembled segments?

Thank you!

paoloczi commented 4 years ago

Did you adjust --Reads.minReadLength to make sure Shasta is using the correct amount of coverage?

Do you know if your assembly is fragmented or messy? That is, do most assembled segments begin/end a a dead end, or at a branch involving other segments? You can find out by looking at the assembly graph using Bandage. The parameter changes to be made will depend on the answer to this question.

It would also help if you could tell us the expected size of the genome you are assembling and post the following files from your assembly directory:

AssemblySummary.html
LowHashBucketHistogram.csv
Binned-ReadLengthHistogram.csv
The entire log output (stdout) from the assembly

It is also possible that you are hitting limitations imposed by the length and quality of your reads, but we will know better once we have the above answers and information.

EvoEpi commented 4 years ago

Thanks for getting back to me!

--Reads.minReadLength was set to 10000.

The input data looks pretty good:

Number reads=8064682
Number of bases sequenced=127375600448
Average read length and stdev=15794 and 15835
N50=30481

I still need to look into Bandage.

Estimated genome size is 2.2-2.5 Gbp.

Here are the requested files: Archive.zip

paoloczi commented 4 years ago

From AssemblySummary.html I see that the Number of raw sequence bases is 56 Gb. And the section entitled Reads discarded on input shows that 72 Gb of coverage were discarded because the reads were too short, that is, shorter than the 10 Kb threshold you used.

Based on an estimated genome size of 2.4 Gb, this assembly is operating at coverage 23x which is too low to give good assembly results - at least with default assembly parameters. We know that at this coverage and using default parameters the assembly is generally fragmented. The contig lengths you are observing are entirely consistent with this.

So you will need to reduce the read length cutoff, probably down to 3-5 kb. Immediately after the reads load, you can look at line 2 of Binned-ReadLengthHistogram.csv, column CumulativeBases, without waiting for the assembly to complete, to see how much coverage is being used.

If the read length is reduced substantially you may also need to reduce --Align.minAlignedMarkerCount from its default 100 to perhaps 50 to get enough alignments. However, doing that reduces the assembler's ability to resolve repeats. But I would first try just reducing the read length cutoff and leaving this parameter at 100 initially.

If you still cannot get a satisfactory assembly please post an update here.

EvoEpi commented 4 years ago

Reducing the read length cutoff to 3kb and/or the Align.minAlignedMarkerCount from 100 to 50 does not improve contig length.

I am running into a 'read-only filesystem error' that has prevented me from using --memoryMode filesystem and --memoryBacking 2M––I am trying to solve this issue. Would setting these parameters to anonymous and 4K, respectfully, be contributing to poor assembly?

paoloczi commented 4 years ago

No, these parameters will not affect assembly quality, but only performance. It is possible that you are hitting limitations imposed by the length and quality of your reads, but if you want to pursue this further please post the same files I requested earlier, for the assembly with 3 Kb read length cutoff.

paoloczi commented 4 years ago

I am closing this due to lack of discussion. Feel free to reopen it or create a new issue as needed.

chanzuckerberg / shasta

Assembled segment length maxed out at 2-3 Mb #153