Closed ekg closed 4 years ago
Assembly parameters are optimized for typical coverage around 60x, and for high coverage you will probably need to adjust some parameters to make things sane. Issue #7 contains a list of parameters that should probably be scaled proportionally to coverage, but so far nobody has confirmed that this prescription works well for high coverage, and I have not had the time to check on that.
Can you attach AssemblySummary.html
for your run and also the expected size of your genome? Also please specify if you used non-default values for any assembly parameters.
With default options, the number of "good" marker graph vertices should be roughly equal to 1/30 the length of your genome (for example, 1000 vertices for a 30 kb genome). If the number of marker graph vertices is much larger than that, most of your marker graph vertices corresponds to errors. Use --MarkerGraph.lowCoverage
and --MarkerGraph.highCoverage
to control the coverage for which marker graph vertices should be generated. The default (10, 100) is reasonable for coverage 60x but you should scale those for your expected coverage. If you are using that default and your coverage is much higher you are getting only marker graph vertices that correspond to errors.
Do you mean --MarkerGraph.minCoverage
and --MarkerGraph.maxCoverage
rather than --MarkerGraph.lowCoverageThreshold
and --MarkerGraph.highCoverageThreshold
?
Here's the summary.html (renamed to .txt) from a run that produced the expected number of output contigs.
Sorry, my comment had a couple of errors:
You are correct, I meant --MarkerGraph.minCoverage
and --MarkerGraph.maxCoverage
. These parameters control the minimum and maximum coverage for a maximum graph vertex. Vertices with coverage outside this range are not generated, the idea being that vertices with very low coverage are the result of errors and vertices with very high coverage are suspicious and could be the result of unresolved repeats. Each run creates a coverage histogram for marker graph vertices that are actually created and writes it to MarkerGraphVertexCoverageHistogram.csv
. In an ideal situation you would want to see a wide peak, with the tails cut out under control of those parameters. The other parameters with similar names you mentioned --MarkerGraph.lowCoverageThreshold
and --MarkerGraph.highCoverageThreshold
control which marker graph edges are subject to removal during transitive reduction. Those might require changes too for your high coverage.
The ratio of 30 I quoted between genome length and number of vertices was incorrect. The correct value is 7. It can be estimated as follows. With the default --Kmers.probability 0.1
, there is a marker every 10 RLE bases on average. Each RLE base corresponds to 1.4 "raw" bases on average, so this means a marker every 14 bases. The marker graph contains two strands, so on average each 14 bases will generate 2 vertices, which corresponds to 1 vertex for each 7 bases. As a sanity check, a typical human genome assembly creates around 400 million marker graph vertices. So in your case if your genome size is 7 Kb you should expect around 1000 marker graph vertices.
From the output you attached, you have about 17000 marker graph vertices. This would be sane for a a 120 Kb genome. If the 28 kb of assembled sequence is a reasonable estimate of your genome size, you should expect around 4000 vertices and you have 4 times too many. So you probably need to change the parameters discussed above. Looking at MarkerGraphVertexCoverageHistogram.csv
may be useful for that. Adjusting those parameters to more suitable values should improve not only performance but also assembly quality.
Were you able to run a successful assembly at good performance for your high coverage viral data? If so, if it is possible for you to share the options you used, that could be useful information for other users, and I can turn it into a custom configuration file for people to use. Of course, you may prefer to wait for your publication before doing that.
If you were not able to run a successful assembly at good performance, please share what the stumbling blocks are and I can help.
I have been able to develop some assemblies. However, I'm finding it very hard to get stable results if I change my input data set. If the read length changes or the overall depth changes then I find that I need to change multiple parameters to get a similar result.
Part of the problem is that I am trying to build assembly graphs that contain variation in my sample, not single sequences representing the haploid mode of the data. Given the algorithm's tuning toward generating the single haploid sequence, this seems to be difficult and somewhat unstable target. I tend to either get a single output sequence or a messy graph. Sometimes changing parameters results in a completely different result, and it can be hard to appreciate the interaction between many of them.
Some are dependent on read length, and some are time dependent on expected depth. It'd be great to be able to scale them all together with respect to the given input data and biological setting.
In any case, I'm using this kind of parameter set to decent effect:
shasta --threads $threads \
--Align.minAlignedMarkerCount 10 \
--Reads.minReadLength 3000 \
--Kmers.k 10 \
--Kmers.probability 0.1 \
--MinHash.maxBucketSize 100 \
--MarkerGraph.minCoverage 100 \
--MarkerGraph.maxCoverage 1000000 \
--MarkerGraph.lowCoverageThreshold 10 \
--MarkerGraph.highCoverageThreshold 100 \
--MarkerGraph.edgeMarkerSkipThreshold 100 \
--ReadGraph.maxAlignmentCount 6 \
--ReadGraph.minComponentSize 10 \
--MinHash.minHashIterationCount 20 \
--MarkerGraph.pruneIterationCount 6 \
--input $input \
--assemblyDirectory $x
I'll keep updating this thread as I make progress.
I do have a question, what defines a "bad disjoint set"?
Found 9146 bad disjoint sets with more than one marker on a single read.
2019-Nov-05 16:40:39.626062 Renumbering disjoint sets to remove the bad ones.
I also see this taking a long time:
2019-Nov-05 16:43:50.058850 assembleMarkerGraphVertices begins.
2019-Nov-05 16:43:50.934860 assembleMarkerGraphVertices ends.
2019-Nov-05 16:43:50.934911 assembleMarkerGraphEdges begins.
2019-Nov-05 16:43:50.935314 0/33434
Could it be run in parallel?
Thanks for the update! Yes, I don't have a good solution for highly variable data, and it boils down to manual experimentation like you are doing.
Keep in mind --ReadGraph.minComponentSize
currently has not effect. Sorry this is undocumented.
A "bad disjoint set" is a would-be marker graph vertex that contains more than one marker for the same read, which would cause a spurious loop in the marker graph. This situation happens occasionally due to errors and/or fluky alignments. The code does not generate these bad marker graph vertices because they could cause assembly artifacts.
The assembleMarkerGraphEdges
step runs in parallel. It uses a batch size of 10000 marker graph edges, so if your marker graph is small parallelism goes away. Normally this step is quite fast, but in your case the low performance could be due to your high coverage. I will reduce the batch size in that step. That should have no negative consequences even for large runs as the start of a new batch happens in a lock-free way and so has very low overhead. I will post here when this change is in, and at that point you can download a new test build from GitHub so you don't have to rebuild it yourself.
I reduced the batch size for assembleMarkerGraphEdges
from 10000 to 10. You can download a test build with this change by clicking on Artifacts
near the top right of this page, then selecting the executable or tar file you want. You will have to unzip. If you are downloading an executable you will also need to set its execute permission bit using chmod
.
See here for more information on downloading test builds.
Thanks for the patch! I'll pull it in and build it for future tests. And for the static build, that'll help on the institutional system I do low throughout testing on.
On Tue, Nov 5, 2019, 19:15 paoloczi notifications@github.com wrote:
I reduced the batch size for assembleMarkerGraphEdges from 10000 to 10. You can download a test build with this change by clicking on Artifacts near the top right of this page https://github.com/chanzuckerberg/shasta/runs/289576882, then selecting the executable or tar file you want. You will have to unzip. If you are downloading an executable you will also need to set its execute permission bit using chmod.
See here https://chanzuckerberg.github.io/shasta/BuildingFromSource.html#DownloadTestBuild for more information on downloading test builds.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/shasta/issues/70?email_source=notifications&email_token=AABDQEMPURR7YD7OB6OCHETQSGZ5NA5CNFSM4JHSQSPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDZE2Y#issuecomment-549950059, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENPLMFXDEUB44DCGG3QSGZ5NANCNFSM4JHSQSPA .
I'm finding that flagCrossStrandReadGraphEdges
takes most of my assembly time, often even longer than the initial alignment step. It runs in parallel, so it seems that it's just a hard lift. Is it worth trying to reduce the effort spent on this step? Can I set --Assembly.crossEdgeCoverageThreshold 0
to skip it entirely? I find this step takes dramatically longer when I allow shorter reads (1kb vs 3kb) with --Reads.minReadLength
.
The flagCrossStrandReadGraphEdges
step tries to eliminate contacts between strands in the read graph. This is probably not an essential step for your data, and could be skipped. This step is normally not a performance bottleneck, but it was never tested at extremely high coverage like in your case.
However, I was disappointed to look at the code and realize that there is no way to suppress that step.
The option you mentioned --Assembly.crossEdgeCoverageThreshold
controls a similar step for the assembly graph, but that is code under development that is currently never called unless you use experimental assembly options.
I will add today an option to suppress the flagCrossStrandReadGraphEdges
step. In the meantime, if you don't mind rebuilding the code, you can do that yourself with a small change in src/AssemblerReadGraph.cpp
, function void Assembler::flagCrossStrandReadGraphEdges(size_t threadCount)
near line 941. You just need to add a return
before the comment line // Store the maximum distance so all threads can see it.
It seems that 95% of my runtime on my (viral) application is in this step. The number of vertices is pretty small usually (thousands). But it can take tens of minutes. Coverage is high, so perhaps this is a problem.
Is there any way to speed it up? It runs in serial in my case. Is that expected? Happy to share my config files if you need to see what I'm doing.