Bioconductor / pwalign

Perform pairwise sequence alignments
1 stars 0 forks source link

Unintuitive results with N's in pairwiseAlignment #6

Open LTLA opened 6 years ago

LTLA commented 6 years ago

The default behaviour of pairwiseAlignment is to perform a quality-weighted alignment, even in the absence of any quality scores in the two input sequences. This is normally fine, as sequences without qualities are given a constant quality of 22L across all bases, and presumably this is just as reasonable as using arbitrary match/mismatch scores in nucleotideSubstitutionMatrix.

However, the use of a high constant quality has odd effects when ambiguous N's are present. Consider:

library(Biostrings)
pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"))

... which gives a score of -29.5 (currently testing on version 2.46.0). This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate. Indeed, using a nucleotideSubstitutionMatrix gives me something a lot more sensible:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"),
     substitutionMatrix=nucleotideSubstitutionMatrix())
# score of 1.25

In short; is the default quality score choice of 22L appropriate for N's? Hacking around to assign low qualities to N's also gives something more reasonable than the default:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"), 
   subjectQuality=PhredQuality(0L))
# score of 2.06

... though obviously this would require more work when only a few bases are N.

hpages commented 6 years ago

I don't think that pairwiseAlignment() has built-in capabilities for handling IUPAC ambiguities out-of-the-box. My understanding is that, if no special arrangements are made, N is simply treated as any other letter. This means that the T / N substitution has the same cost as the T / A substitution:

> library(Biostrings)
> pairwiseAlignment(DNAString("T"), DNAString("N"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: N
score: -5.899286 
> pairwiseAlignment(DNAString("T"), DNAString("A"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: A
score: -5.899286 

If you want N to be treated as an ambiguity letter that stands for A, C, G, or T, you need to set the costs of the A / N, C / N, G / N, and T / N substitutions to the same as for the A / A, C / C, G / G, and T / T substitutions. For example by doing something like this:

> mat <- diag(15L, nrow=5, ncol=5) - 5L
> rownames(mat) <- colnames(mat) <- c("A", "C", "G", "T", "N")
> mat["N", ] <- mat[ , "N"] <- mat["A", "A"]
> mat
   A  C  G  T  N
A 10 -5 -5 -5 10
C -5 10 -5 -5 10
G -5 -5 10 -5 10
T -5 -5 -5 10 10
N 10 10 10 10 10
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTNTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTNTT---
score: 6 
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTTTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTTTT---
score: 6 
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTGTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTGTT---
score: -9 

Does that make sense? H.

LTLA commented 6 years ago

Yes, indeed, if substitutionMatrix is set to the output of nucleotideSubstitutionMatrix(), everything works as expected, as I mentioned above. My problem is with the default settings, where nucleotideSubstitutionMatrix() is not used. Rather, each input sequence is assigned a quality of PhredQuality(22L) (if none was specified), and quality-based alignment is performed. This causes some problems when the input sequence contains ambiguous bases, in particular N.

To be clear, my beef is not with the use of the quality-based alignment algorithm itself. In "real world" data, any Ns should have very low qualities by definition (as the base caller could not figure out its identity), and thus should contribute little to the alignment score. The problem here is that, when input qualities are not specified, Ns and other ambiguous bases are automatically and silently assigned qualities of 22L, resulting in strong mismatches to other bases. This is very unintuitive; it took me days to figure out why my adaptor sequences containing a stretch of Ns were not aligning to my data.

I can think of two alternatives to the current defaults:

  1. Use a nucleotideSubstitutionMatrix() when two DNAString[Set] objects are supplied without qualities. Throw an error (or warning, if you use suggestion 2 below) if only one of the objects have qualities. This is probably the clearest and most intuitive/expected for a typical user.
  2. Set the default qualities so that any ambiguous bases have the lowest possible quality, thus avoiding strong mismatches. The quality-aware alignment does, in fact, handle IUPAC codes correctly when the quality scores are low (see the \gamma parameter in ?pairwiseAlignment).
hpages commented 6 years ago

Hi Aaron, The proposed changes sound reasonable but there is a lot of code around that uses pairwiseAlignment() and I want to take the time to investigate/evaluate how it would be impacted by these changes. We're probably too close from the next BioC release for that (I would need some time to fix the packages I break by making these changes and I don't have that time at the moment) so I'm postponing this to after the release. Thanks, H.

LTLA commented 6 years ago

Okay, fair enough, thanks.

LTLA commented 5 years ago

I eventually figured this out - the alignment scheme is working correctly, but fuzzyMatrix needs to be specified when dealing with quality-scaled DNA strings:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"), 
     fuzzyMatrix=nucleotideSubstitutionMatrix())
# gives a score of zero.

This seems like something that should be default when pairwiseAlignment sees DNA string inputs.

acvill commented 1 year ago

This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate.

I think PR https://github.com/Bioconductor/Biostrings/pull/77 addresses this. An asymmetric substitution matrix avoids penalizing alignments in the case of ambiguous subject sequences.