RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

Disabling softclipping #34

Open jarrettrios opened 2 weeks ago

jarrettrios commented 2 weeks ago

Hello again! I was wondering if there was a way to disable soft clipping for alignment, or heavily penalize its use.

Im trying to find the differences in alignments between Arioc and Bowtie2, so that I can pitch its use to my team, but Arioc likes to softclip in our variable regions with the default recommended settings. I can give additional points awarded for correct matches to counteract this, but it ends up with very different results from how we use Bowtie2 (and from what I expect to see) and it adds a confounding variable to measuring how its performing.

We currently use global alignment with bowtie2 (Since our reference genomes and reads are almost the same length) and we utilize its ability to not penalize "N"s in the reference genome. If that's all out of scope I understand and will try and narrow down a scoring system that gets the results we expect to see!

jarrettrios commented 2 weeks ago

I would like to retract my statement about not with alignments not being what I expected to see, I am now sure that the issue is that our variant calling is not handling the soft clips. That would make that part my fault!

RWilton commented 1 week ago

Ok. It's good that you have resolved this problem.

Part of short-read DNA alignment folklore is that end-to-end alignment is superior in some way to soft-clipped alignment. This notion may stem from protein sequence alignment where related amino acid sequences are assumed to have comparable length and functionality. But these assumptions do not apply when mapping short DNA sequences to a reference genome. (Dan Gusfield's "Algorithms on Strings, Trees, and Sequences" describes this in more detail if you are interested.)

In fact, constraining the first and last bases in every read to map to a corresponding base in the reference genome degrades the mapping of the remaining bases in the read. Unless the entire read maps perfectly, the alignment algorithm introduces gaps and/or mismatches in order to comply with the end-to-end constraint.

Besides, even if you think you are excluding a few soft-clipped bases from the ends of some mappings, there should be plenty of other reads that cover those soft-clipped reference-genome positions (assuming, of course, that you have sufficient coverage of the reference genome).

And no, soft clipping cannot malignly force reads to map to the wrong place in a reference genome. That's not how DNA sequence alignment works. Sure, it's possible for a read sequence to be degenerate (or perhaps to contain significant structural variation) so that its highest-scoring mapping involves an unacceptably high number of soft-clipped bases, but AS and MAPQ filtering of your alignment results reliably excludes such mappings. (Felix Kreuger digs into this in some detail here).

In any event, variant callers rely only on POS and MAPQ to identify loci of potential variation and then do their own read (re)alignments to characterize variation at those loci, so the only aligner mappings you really care about are those that aren't mapped to the reference genome location you expect. I would be surprised to see significant differences between BWA, Bowtie 2, and Arioc in this regard, since we looked pretty closely at that just last year.

jarrettrios commented 1 week ago

Everything you said makes sense, though our pipeline is pretty different from others as our reference genome is only mildly larger than the reads themselves, and we know what each of the reads should generally be. We basically only amplify a very specific region of the genome of interest, and then use the sequencer to count how many times a variant of interest is seen. Essentially every reads is considered a variant that passed a given assay successfully (We're looking at proportions of reads to the overall total since we have to normalize).

We don't actually pay attention to the MAPQ score at all, we just take the best alignment and make sure there aren't any indels or low quality reads (Q-score from the sequencer). The problem is that if our region of variation gets soft-clipped, we basically lose the entire reads and that can obviously change the outcomes.

I was able to get around this mostly, by just upping the scoring of matches to far outweigh the penalties of a mis-match, but it would be easier to convince my team that we're getting the same alignments, just faster, if we could use the same scoring penalties/bonuses as the default settings. To do that we'd need to not penalize mismatches of against the N's in our sequences.

I do want to say that if it is not trivial at all, please just feel free to not worry about it! The results are still very close to what we were seeing before and the speedups are great even just using 1 GPU. I think that alone will be enough to get people to swap aligners in our pipeline!

RWilton commented 1 week ago

I agree that your pipeline is unusual. Each of the recent problems you have reported have stemmed from the use of very short reference sequences and very short reads. These are unusual cases in the broader context of DNA short-read alignment -- so thank you again for helping to make Arioc handle these situations correctly.

A few comments:

-- I think you really ought to consider using MAPQ to filter your results. Each of the variant callers we have looked at (GATK HaplotypeCaller, DeepVariant, FreeBayes) uses MAPQ to filter input and/or considers it when modeling variant call quality. Also, there can be an "impedence mismatch" between MAPQ values reported by an aligner and MAPQ values recognized by a variant caller. This is especially true with Bowtie, which uses a different range of MAPQ values than HaplotypeCaller (for example) expects by default.

Furthermore, by ignoring MAPQ you risk using reads whose sequences map ambiguously, i.e., equally well in two or more different places in your genome (or to two or more different reference sequences). Your results will then depend on the read aligner's arbitrary choice of which equivalent mapping to report as the primary or "best" alignment.

I'm not sure how BWA behaves, but Arioc tends to pile such mappings up in the same place whereas Bowtie distributes them randomly. (I think. I'm not 100% sure.) There's a good argument for either behavior -- but the point is that such mappings are untrustworthy in the first place because the aligner cannot unambiguously determine which POS to report.

By definition, the way to identify such mappings is to use MAPQ.

-- Soft-clipping by the aligner is recognized by variant callers like HaplotypeCaller and DeepVariant in the process of identifying regions of interest. What matters is that the read aligner report mappings whose POS is near those regions of interest, so that the variant caller can determine where to look for variation in the first place. You can control a variant caller's handling of soft-clipped bases to some extent (e.g., --dont-use-soft-clipped-bases in HaplotypeCaller), but I would experiment with this to be sure that the results are what you expect them to be.

-- You can indeed affect alignment scores (AS) by implementing a separate mismatch score for N in a read sequence. But it's hard to see how that will affect the aligner's ability to report POS correctly (unless, of course, there are so many Ns that the AS is too low to be usable). This is because the aligner uses only the non-N part of a read sequence to search for candidate mapping locations. The alignment scoring scheme does not affect this "seeding" part of the alignment process.

I agree that your somewhat unusual read and reference sequences are going to require you to override the "default" behaviors of your read aligner and your variant caller. These tools are tuned to handle longer short-read sequences and to genomes where the reference sequence is much longer than the reads. A bit of careful parameterization now may save you a lot of aggravation in the long run!