Optimizing Shasta for short read lengths (<10kb)

000generic commented 5 years ago

Hi!

I'm super impressed with the speed of Shasta - and I'm taking advantage of its speed to explore parameters to optimize/maximize the size of genome assembly and longest assembled segment for ONT and PacBio data sets having read lengths skewed to <10kb in different marine invertebrates.

For ONT:

I'm wondering if you could recommend Shasta parameters to explore for working with ONT data sets that are skewed to shorter read lengths of 500-10000 bp.

I am assembling ~60x PromethION (~15 million) reads of an estimated 2-2.2 Gb pygmy squid (Idiosepius paradoxus) genome. Unfortunately, sequencing ran into pore clogging issues unless the DNA was sheared and size selected - so our reads are mostly smaller than 10 kb (~65% reads and ~33% bases) and even under 1 kb (~20% reads and ~0.5% bases). I read that default settings in Shasta are optimized for read lengths >10kb, so I'm interested in exploring different parameters to optimize Shasta for my data set.

To begin with I tested minimum read length cutoffs (--Reads.minReadLength) from 500-10000 bps. The results were interesting, in that having more short reads didn't always help - and max lengths in genome assembly and in longest assembled segment occurred at a cutoff of 6000 bp (see below). This is promising but at a 6000 bp cutoff over 45% of the reads are rejected - so I'd like to optimize other parameters to see if I can improve things and use more of the reads.

Also - here is the Assembly Summary for a 1000 bp cutoff, as an example.

Given my ONT read lengths skewed to smaller sizes, I'm unsure what additional Shasta parameters I should be exploring and was wondering if you have any suggestions.

Also, are there reasons to flat out reject short ONT reads - for instance, I'm wondering if they generally have even higher rates of sequencing error - or for other reasons are generally viewed as poor quality and to be avoided in assembly?

For PacBio:

I saw that use of Shasta with PacBio was discussed and closed for now in Issue #56 . I can move my PacBio questions to that thread, no problem - just let me know.

I'm preparing to do Shasta assembly of deep PacBio sequencing of a ctenophore (Bolinopsis species) with an estimated genome size of 200-300 Mb. Like my ONT data set above, the PacBio data skew to under 10 kb - at least in comparison to a typical human ONT dataset that Shasta is optimized for.

I plan to follow Issue #56 recommendations to explore read length and kmer length - and use the Modal consensus caller for repeat counts. Are there other parameters in Shasta that might be useful to explore, given PacBio datasets - and/or do you now have recommended Shasta settings for PacBio?

I'm happy to update on parameter exploration as things progress on both the ONT and PacBio data sets above, if there is interest.

paoloczi commented 5 years ago

Finding parameter sets that work well in the usage cases you described would be very useful, and I would ask that, if successful, you please report your findings so others can benefit (I could add to the repository configuration files optimized for these and other use cases). On my side, I will do my best to try and help with your experimentation.

Here are my thoughts:

For ONT: assembly parameter --Align.minAlignedMarkerCount controls the minimum number of aligned markers for an alignment to be used in the read graph. The default is 100 markers which typically corresponds to roughly 3 aligned Kb. This makes reads below 3 Kb completely unusable (even though the summary file will say they were used in the assembly!). And even reads in the range 3 to 6 Kb are only used if another read is found that overlaps them for at least 3 Kb (and remember the MinHash procedure does not guarantee that all overlapping pairs of reads will be found). Bottom line, I suggest that you decrease --Align.minAlignedMarkerCount to 50 markers or even less. Of course, the more you decrease it, the more you increase the possibility of incorrect assembly errors, so there is a trade-off there. Additional parameter changes may be necessary, but this is probably the one that affects you the most.
For PacBio: as an initial experiment, I would suggest the same two parameter changes (reducing --Reads.minReadLength and --Align.minAlignedMarkerCount. If you could report what you get with that, I may be able to give additional suggestions. This assumes that you are again at coverage 60x or so. If your coverage is lower, please see issue #7 for a suggested list of parameters to be scaled proportionally to coverage.

If you are using a recent version of Shasta, I also suggest looking at output file ReadSummary.csv in the assembly directory. It contains one line of information for each read, and some of that information might give you insight on how the reads are being used in the assembly. If you want to dig in more detail, you can also use --command explore to look at an assembly. See here for information on how to do that.

000generic commented 5 years ago

Lots of great advice - thank-you!

Over the weekend - based on patterns of longest assembled transcript per read length cutoff - I was thinking to split the reads at 4000 or 5000 bps and explore their assembly parameters independently - and then merge with QuickMerge - or even then run Shasta on the two assemblies plus reads.

Given this, one idea now is that in addition to testing the full set of reads, I can try splitting the reads and explore the --Align.minAlignedMarkerCount parameter that you suggest. My hope in this case would be that splitting the reads across a read length threshold / more generally tailoring the aligned marker count to read length can minimize the assembly errors and improve overall assembly.

Along these lines, if splitting reads for parallel subassemblies ends up being useful - I wonder - is it possible to use specific aligned marker count settings for specific read lengths - or even determine settings for optimal marker counts per read length per data set (for instance, based on global error rates of a given data set - if that makes sense - I don't fully understand the algorithm side of things yet) - allowing datasets to be inherently or internally optimized for aligned marker counts per read length?

A limitation in splitting the data in half for assembly is that coverage for either data set would be around 30x - but I can also follow low coverage suggestions in Issue #7 - thanks for pointing it out!

At least for some of the ctenophore species - maybe not the one I mentioned (have to check) - we have PacBio in triple-digit coverage. Is it best to keep things under some set amount of max coverage - or if we have 200x or more, use it? I saw the scaling per 60x defaults suggestion for high levels of coverage in Issue #7 and can follow it - but if there are new ideas/guidelines would be great to hear them.

I'll update, as things progress....!

paoloczi commented 5 years ago

I suspect splitting the reads is not a good idea. I would expect that it is best to present all reads to the assembler at the same time.

Keep in mind that the suggestions in issue #7 are tentative - nobody has ever verified that they are useful. So if I were you I would stay at 60x and not split the reads.

Regarding PacBio, I would also stay at 60x at least initially. The extra coverage can do more damage than good if the assembly parameters are not adjusted. You have two things to adjust for: the short read length and the extra coverage. It's best to work on one at a time...

000generic commented 5 years ago

I'm further testing different ideas to optimize Shasta for my data set / to specific data sets in general. These approaches may or may not make sense/be a good idea. But its been great so far in getting me into the weeds for working with Shasta - and has lead to the Chop-Combo pipeline I'm now exploring:

1) Optimize: Given my read N50 is short around 12 kb, I'm doing Shasta assemblies over parameter ranges of minimum read length (500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000) x aligned marker count (10, 25, 50 100) values - and found what looks to be an optimum that maximizes assembly and assembled segment lengths for my data set (6000 x 10).

2) Chop: Although my raw reads are at ~60x coverage, as recommend for Shasta under default settings - I'm coming to recognized that functional coverage can be much lower due to minimal read length cutoffs and aligned marker count default values that are high relative to the many short read lengths (<10 kb) that are the majority in my data set.

Also, when I optimized minimum read length and aligned marker count for assembly of my full-length reads, I was surprised that rejecting reads less than 6 kb improved assembly length metrics over including large amounts of smaller reads for assembly. It seems a shame to not put to use 20-40% of my reads. So I wondered if Shasta might be optimized for short vs long read lengths.

For instance, small aligned marker counts in Shasta could pull in many smaller reads but I wondered if it might also cause increased misassemblies in longer reads - or if assembly interactions of longer reads with shorter reads might hinder shorter read assembly abilities under shorter vs longer read-length optimized parameter settings - or other assembly performance/quality issues. If so, it could be worth forcing the data set to be short or long in read length - and then optimize Shasta assembly parameters.

To force the data to be short, I can chop long reads into smaller pieces prior to assembling, such that all reads are <=10000, <=5000, <=1000, or <=500 - and then optimize Shasta assembly using the ranges of minimum read length and aligned marker count settings above. This is still ongoing and there are interesting shifts in the assembly length metrics that make it seem like there are distinct regions being assembled (misassembled?) relative to the full-length read assemblies.

Chop is an alternative to Split - which was my previous idea mentioned above to split the read data set based on size prior to assembly to enable separate short vs long read length assemblies. But like you pointed out, this would mean not all read/sequence data is being brought to bear on assembly - and splitting up the read data set into subsets would greatly change coverage, which could be detrimentral to Shasta performance, as its optimized for 60x coverage. Unlike Split, Chop lets me maintain coverage and bring potentially most/all sequences to bear on assembly but still tailor/optimize assemblies to read length. It has a loss in read contiguity but that is hopefully where Combo comes in.

3) Combo: Combining ( full-length reads ) + optimal ( full-length-read ) or ( chopped-read ) assemblies

To leverage gains in the short/chopped read assemblies with the longer reads - and to optimize Shasta for longer reads, I thought to combine all reads plus the optimal chopped reads assembly. Basically, determine optimal settings for short reads to get an assembly and then use that assembly in combination with the full read set - and optimize Shasta assembly per above. This is still ongoing but already some of the Chop-Combo assemblies have outperformed all assemblies to date in steps 1) and 2) - so it looks promising:

As you can see, longest assembled segment length and assembly N50 are increased - both are more than double that of default settings - while number of assembled segments is reduced by a third - all potential improvements over the default assembly and over all optimized parameter assemblies from Step 1) (not shown). Total assembly size increased only a small amount - but the above internals of the assembly seem to be much improved - though still far from chromosome scale.

I am wondering - are there are other parameters you would suggest I explore for optimization in this context?

One of my concerns in assembling with previously assembled segments as part of the read data set is that assembly artifacts might be amplified/increased - although if this were only a small amount, the other gains might make it worth it. In the case of pygmy squid, there is no reference for me to evaluate misassemblies and other artifacts against. So I am thinking to mirror my pygmy squid read length distribution into a human data set - and then run the Chop-Combo pipeline on human to assess things. I was thinking it would be great to follow the QC workflow you laid out in the BioRxiv paper for Shasta - and was wondering if you could make one or more of the ONT read data sets available for me to work with. No problem if not - I realize it is still pre-publication etc.

I'm not sure where all this is going - but maybe it will prove to be useful. Certainly, it looks like exploring the parameter space for a given data set pays off - and is reasonable to do, given how fast Shasta is. And maybe Chop and/or Combo is a good idea - or can be worked into one. Once I finish up in a few days, I can post a full characterization of all the assemblies based on the AssemblySummary file - plus, using NG50 instead of N50 to facilitate comparisons across assemblies.

Any ideas or suggestions would be greatly appreciated :)

paoloczi commented 5 years ago

Thank you for this summary. You are right that reducing the minimum number of markers in an alignment can result in assembly errors. There is definitely a trade off there, and the optimal point can depend on the amount and characteristics of repeats present in the genome being assembled.

I don't like the idea of splitting (or "chopping") reads because it results in loss of information. But if I understand correctly you are proposing that only for the purpose of investigating optimal parameters under different conditions?

Also, ideally you would want to optimize not just for assembly contiguity, but also take other accuracy metrics into consideration. It's very easy to create an assembly with a large N50 (e. g. you can just concatenate all of your assembled segments and get a very large N50!), but it is harder to get a large N50 while avoiding assembly errors.

You could also look at gfa output, for example using Bandage, to get an idea of what the assembly graph looks like. If the graph is fragmented, the N50 is limited by coverage and/or connectivity. If the graph is messy, the N50 is limited by your ability to resolve repeats. Each of these two situations requires a different strategy.

ekg commented 5 years ago

During processing of the read graph, aren't reads that are contained within longer ones collapsed into the longer ones?

This seems to make sense for getting a longer assembly in the case of a low-diversity sample. But it seems that it might cause information loss where the shorter reads have variation that's important, or if they are simply at a much higher depth than the long reads they're collapsed into.

Maybe a parameter to only induce this collapse when the reads are shorter than a certain length could be used to mitigate this effect. It would be preferable to control it at that level than by removing alignments from the input or chopping them up.

I could be confused if this is what is happening, but I'm running into similar issues in the phage data that we've been discussing in #70. The next thing I was going to try was to remove long reads from my assemblies. Before doing that, I looked at the other performance related threads in your issue tracker and found this one.

paoloczi commented 5 years ago

Reads that are contained in a longer read are not discarded, precisely for the reasons you mentioned. Creation of the read graph is currently simplistic: for each read I just keep the --ReadGraph.maxAlignmentCount best alignments (default 6), as measured by the number of aligned markers.

I hope to make substantial improvements in the creation of the read graph soon.

I suggest you use the Shasta http server to visualize portions of the read graph. See here for more information. This will probably give you a better idea of what is going on.

ekg commented 5 years ago

Reads that are contained in a longer read are not discarded, precisely for the reasons you mentioned.

Sorry, I thought I had read this in your documentation, but I can't find what made me think this. In any case, good.

for each read I just keep the --ReadGraph.maxAlignmentCount best alignments (default 6), as measured by the number of aligned markers.

Would this pull together sets of long reads while encouraging shorter ones to align together, possibly in smaller isolated subgraphs?

I might not have the right intuition, so I'll use the shasta server to check out the marker graph and read graph on one of my test cases.

paoloczi commented 5 years ago

The current technique works reasonably well for human assemblies at typical coverage, except in centromeres and in many segmental duplications. But I can easily imagine that it can have shortcomings in your situation. Hopefully there will be a new and better approach to read graph creation soon.

Exploring the assembly using the http server should help. Let me know if you have questions about that because that area of functionality is not well documented. It is actually possible to start from a location of assembled sequence and track it down to the originating location in the marker graph as well as the read graph.

If you are seeing isolated subgraphs in the read graph, it is possible that your MinHash parameters are not adequate. The http server will help you look at this too because it has a feature that allows you to compute alignments of a chosen read with all other reads. It is slow but useful.

ekg commented 5 years ago

Exploring the assembly using the http server should help. Let me know if you have questions about that because that area of functionality is not well documented.

Continued in #73.

paoloczi commented 4 years ago

I added some configuration files for both ONT and PacBio reads. See the conf directory. Make sure to read the comments in the configuration files before using them.

chanzuckerberg / shasta

Optimizing Shasta for short read lengths (<10kb) #61