bcgsc / abyss

:microscope: Assemble large genomes using short reads
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Other
310 stars 108 forks source link

exome parameterization question #365

Closed devonorourke closed 3 years ago

devonorourke commented 3 years ago

Hi all,

I've seen a few whispers on the ABySS Google Groups pages indicating that it's possible to use Abyss for de novo assembly of whole exome (not genome) data. One thing Shaun has replied to a few times (back in 2010 and 2012!) was something along the lines of:

For transcriptome or exome capture data (or other non-genomic data sets), it is good to specify E=0

or

Yes, I have used ABySS to assembly exome data. I'd recommend setting the parameter E=0 to avoid trimming too much from the ends of exons.

So, at least for earlier Abyss versions, one thing that Shaun recommends is setting E to 0. This got me to wondering the following:

Thanks for any tips and tricks you can offer!

vlad0x00 commented 3 years ago

Hi Devon,

If you're using ABySS2, i.e. the Bloom filter version (enabled by passing the -B parameter) which is more recent and the one we recommend, the -E parameter is not used (@sjackman @jowong4 please correct me if I'm wrong here), so you don't have to worry about it.

As for what parameters to adjust, in addition to -k, we also standardly try a couple of --kc parameter values, usually between 1 and 4. This parameter determines the minimum k-mer multiplicity for assembly in order to get rid of erroneous k-mers.

And since I'm recommending the use of Bloom filter version, you'd also need to determine the correct Bloom filter size (-B parameter). This generally depends on how many reads you have (or what genome you're assembling, since those two generally correlate). For a ~3Gbp H. sapiens assembly, a 40GiB Bloom filter is common, i.e. set -B40G. For a ~20Gbp spruce genome assembly, we use -B500G whereas for smaller genomes like C. elegans perhaps as low as -B2G.

Hope this helps, :-) Cheers

devonorourke commented 3 years ago

Great tips @schutzekatze , thank you very much for the quick response.

A couple of quick follow ups:

Now, one final question that I'm happy to split into a separate issue, but it still falls into this topic of exome-specific data. As an example, say I had a bunch of 150 bp PE Illumina data from a human exome. I've tried to (1) use ABySS2 to create de novo assembled contigs, and additionally, (2) I've mapped those Illumina reads directly to the human reference assembly too. I've been struggling to understand why ABySS might be producing contigs that are slighly truncated in exonic regions where I have loads of reads that appear to map directly where those ABySS contig gaps persist. I keep thinking there is some sort of parameter I'm not setting properly. Here's a visual of what I mean - if I map the contigs to the human reference, I see instances where there's a contig that lines up exactly where it should, except there's a chunk of the expected exon missing from the ABySS-assembled contig (represented by ---):

h38ref: GCTGAGACTTCCTGGACGGGGGACA
contig: --------TTCCTGGACGGGGGACA

However, when I look at the pileup of reads that ABySS was supplied with, but instead those reads were mapped directly to the reference, it's clear that they span the reference:

h38ref: GCTGAGACTTCCTGGACGGGGGACA
pileup: GCTGAGACTT
             GACTTCCTGG
                  CCTGGACGGG
                       ACGGGGGACA

Maybe I'm thinking about this entirely wrong, as I'm framing my expectation in an overlap-consensus mindset, whereas ABySS is assembling contigs from a graph instead. Nevertheless, it would be wonderful to have any pointers on how to troubleshoot and improve ABySS exome assemblies in situations where direct mapping of reads suggest a contig should have been extended.

Appreciate any and all further tips you can offer!

vlad0x00 commented 3 years ago

You're right, I should've said that my tips are for genome assembly. I am not confident in whether they'd be appropriate for exome assembly and perhaps @sjackman might have more information.

Regarding the issue of truncated contigs, the first thing that comes to mind would be the --kc parameter, as it would filter out some k-mers. Did you try --kc 1? The default is --kc 2. Setting it to 1 would preserve all the k-mers.

Another issue that I can think of would be if the truncated region is a repeat, making a branching point in the assembly graph. If the assembler is unsure which path is correct, the contig would stop at the branching point.

devonorourke commented 3 years ago

Thanks once again @schutzekatze for the clarifications,

The first batch of tests of these exome data used ABySS2 (v.2.1.5) without a bloom filter, nor did it specify any --kc argument. However, in addition to a range of k values, the original tests did explore a range of l (40, 50, 60, 70) and s (800, 900, 1000, 1100, 1200) parameters.

I'll certainly explore lowering the --kc value next. If you or any other developers have further recommendations for parameterizing/testing exome data specifically, I'd greatly appreciate it.

Cheers

sjackman commented 3 years ago

I'm happy to steer clear of the -E parameter if you recommend invoking the Bloom filter. I wasn't sure whether these were mutually exclusive or not. Why not set that E value to zero with exome data?

When using the Bloom filter mode, setting E will cause an error abyss-bloom-dbg: invalid option -- E.

When assembling exome data and not using Bloom filter mode, set E=0 to avoid losing sequence at the ends of exons.

devonorourke commented 3 years ago

Thanks for the update @sjackman.

I'll launch some jobs with and without invoking the Bloom filter mode. Not sure whether you'd recommend the Bloom filter for exome data (or not)?

Another parameter I've been thinking that might be of use to adjust would be s. My sense was that unitig extension set to something like s=1000 was too long for target capture data like whole exome sequencing. Maybe I'm very mistaken...

Thanks for any other tips you can offer, and thanks to all the developers for maintaining a great piece of software.

sjackman commented 3 years ago

Another parameter I've been thinking that might be of use to adjust would be s. My sense was that unitig extension set to something like s=1000 was too long for target capture data like whole exome sequencing.

I agree that reducing s makes sense to me. A side effect may be that contig generation could be seeded off of polymorphic contigs, so you could end up with the two copies of the exon (or possibly more in unexpected circumstances), one from each haplotype. That may be a feature! depending on your downstream analysis. Just something to keep in mind.

csgenomics-admin commented 3 years ago

Hi @sjackman and @schutzekatze ,

I've finished up with a few tests exploring the effects of the s and E parameters, as well as applying a Bloom filter (or not) on this exome dataset. The original ABySS assembly used a parameter of s=800, and neither deployed a Bloom filter, nor did it set any E value. However, the new assemblies set s to either 200 or 500; and specified either E=0 without a Bloom filter, or did specify a Bloom filter with kc=1 or kc=2 (without specifying E). One thing that I can confirm is that I consistently see an increased contiguity in the new assemblies in exon regions - so it sounds like your suggestions were on point (as expected!) - thanks! However, there were some behaviors that arose that I'd like to get your feedback on if possible.

In the screenshot below, I mapped the raw reads used in these ABySS assemblies to a related species, and the pileup on top (red arrow) reflects the coverage depth. I then mapped each of the scaffolds from the various ABySS assemblies to this same reference genome. The original assembly often produced scaffolds (indicated by the black arrow), where the read pileup on top extends to the left, but the original scaffold doesn't extend across all the region where it seems like there should be sufficient read coverage to generate a longer scaffold. After running ABySS with the various parameters mentioned above, it seems like that region gets filled in, though different parameters seem to break the scaffold apart in some assemblies where the Bloom filter was applied:

abyss_dev_q1_paramMods

Any thoughts on why applying a Bloom filter breaks these scaffolds up, when an assembly without a Bloom filter used retains just a single scaffold in this region?

After a bit of pondering why ABySS would generate scaffolds that overlapped, I started to look for cases where an exon region had >2 scaffolds overlapping. I found some extreme cases, but it's worth noting that they are rare: I have something like ~200,000 distinct exon regions, and am finding just ~1,000 instances where there are at least 4 scaffolds that overlap somewhere within an exon. I figured ABySS would have joined these scaffolds together and am stuck wondering why it wouldn't. It almost always seems to be the case that exon regions with many overlapping scaffolds tend to have extremely high read depth: something like 500-1000X coverage of reads mapped to the region of interest. Additionally, the overlapping scaffolds are almost always quite short - less than 100 bp. Naturally, you might toss out these super short scaffolds in a whole genome assembly, but if I was to filter out these short scaffolds across the entire assembly, I might be missing out on many good scaffolds that don't have these extreme overlap scenarios. If only I had my hands on some longer reads to anchor these scaffolds to...

I'm curious if you've noticed this kind of behavior before? Perhaps it might be related to any one of the particular parameters I'm invoking when creating these assemblies. Here's a screenshot below for one the kind of situation I'm discussing. Just to reiterate this - you're looking at a set of aligned scaffolds below, not a set of aligned reads:

abyss_dev_q2_paramMods

sjackman commented 3 years ago

I wonder whether it'd be worth giving RNA-Bloom a shot on this data. Perhaps it could better handle the variation in depth of coverage over the length of the capture region, and variation in depth of coverage between capture regions. https://github.com/bcgsc/RNA-Bloom#readme

sjackman commented 3 years ago

In the first figure, the two Bloom filter assemblies with kc=2 look strictly better than the original s=200 assembly. The primary scaffold is longer, and it has one additional shorter scaffold that was not assembled in the original assembly. The two scaffolds are not merged due to that indel in the overlap region.

csgenomics-admin commented 3 years ago

Oh come on @sjackman - here I am finally getting somewhere with improved contiguity and you're throwing an entirely new program at at me! :)

Regarding RNA-Bloom, I take it the exome dataset would fall into the category of paired-end bulk RNA-seq agnostic, correct?

One other question: the kc=1 assemblies have two scaffolds that overlap without any indels that remain separated, but in the kc=2 assemblies they are unified into a single scaffold. Am I thinking of this backwards? - doesn't increasing the integer for the kc parameterincrease the required minimum kmer count? I'd have thought that as that value increased from 1 to 2, I'd get the opposite effect of what I'm seeing here (that is, the kc=1 value would require a lower kmer multiplicity and produce more contiguous scaffolds than kc=2).

Thanks again

sjackman commented 3 years ago

Oh come on @sjackman - here I am finally getting somewhere with improved contiguity and you're throwing an entirely new program at at me! :)

Hah! There is often an overabundance of tool options in bioinformatics.

Regarding RNA-Bloom, I take it the exome dataset would fall into the category of paired-end bulk RNA-seq agnostic, correct?

Sounds right to me. Ka Ming (@kmnip) could comment on whether he's tested RNA-Bloom with exome sequencing data.

One other question: the kc=1 assemblies have two scaffolds that overlap without any indels that remain separated, but in the kc=2 assemblies they are unified into a single scaffold. Am I thinking of this backwards? - doesn't increasing the integer for the kc parameterincrease the required minimum kmer count? I'd have thought that as that value increased from 1 to 2, I'd get the opposite effect of what I'm seeing here (that is, the kc=1 value would require a lower kmer multiplicity and produce more contiguous scaffolds than kc=2).

You may have to really go digging, but my expectation is that you'll find a small (~ 2*k-1 bp) contig that also aligns in that overlap region that shows a variant nucleotide. That k-mer is present and prevents unambiguous contig elongation at kc=1, but gets removed at kc=2, so contig elongation can proceed unambiguously.

kmnip commented 3 years ago

Regarding RNA-Bloom, I take it the exome dataset would fall into the category of paired-end bulk RNA-seq agnostic, correct?

@csgenomics-admin Yes, that is the correct option for RNA-Bloom. I haven't used it for exome before, but you can create an issue there if you run in any issues.

I think you could possibly use an even smaller k and a much lower s in ABySS, e.g. k=25 s=25.

csgenomics-admin commented 3 years ago

Thanks @kmnip , Just to clarify - would you similarly recommend specifying a lower k parameter in RNA-Bloom also (ex. k=25)? And secondly, I don't believe there is an equivalent s parameter in RNA-Bloom that exists in ABySS, correct? My thinking was that perhaps I would lower the -length parameter to something like 100, but it's my understanding that the length parameter is about filtering out transcripts in the output, and not part of any unitig joining action during the assembly itself. Cheers!

kmnip commented 3 years ago

RNA-Bloom's default k-mer size is 25 and it does not have an s parameter. So, you can just use the default settings. And, you are correct about the length cutoff. which is used to filter the final set of output sequences.

csgenomics-admin commented 3 years ago

Hi folks,

Circling back to one of the images shared earlier in this post: image

In that image, I'm showing an alignment of the ABySS scaffolds to a reference assembly of a related organism. The original questions in this thread were motivated to understand what parameters may help extend these scaffolds given that I suspected I had sufficient read coverage where scaffold ends were getting clipped.

However, in looking through more of the various parts of the ABySS workflow, I'm started to wonder why there were so many scaffolds stacked retained in these overlapping positions. For example, let's say I have two scaffolds who have identical sequence lengths (say 100bp) and nucleotide sequence composition, differing at just one position, like this:

seq1   ...................................................G................................................
seq2   ...................................................A................................................

I would have thought that the PathConsensus would merge these? They have more than the default 90% identity. I'm wondering if perhaps there are situations like this above where there are perhaps very many branches in the graph, and the --branches parameter needs to be increased? Or, perhaps the --dist-error value needs to be modified beyond the default value.

In particular, I'm wondering if I've missed something in the documentation where the merging is restricted to branches of a particular minimum length? In my de novo assemblies, after aligning to a related reference assembly, I see instances of short (60-150 bp) contigs that pileup neatly over each other, with very few substitutions or indels. I figured ABySS would have merged these, and am still scratching my head why it won't.

Thanks again for any tips you can offer, Devon

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in ABySS!