Open LTLA opened 6 years ago
I don't think that pairwiseAlignment()
has built-in capabilities for handling IUPAC ambiguities out-of-the-box. My understanding is that, if no special arrangements are made, N is simply treated as any other letter. This means that the T / N substitution has the same cost as the T / A substitution:
> library(Biostrings)
> pairwiseAlignment(DNAString("T"), DNAString("N"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: N
score: -5.899286
> pairwiseAlignment(DNAString("T"), DNAString("A"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: A
score: -5.899286
If you want N to be treated as an ambiguity letter that stands for A, C, G, or T, you need to set the costs of the A / N, C / N, G / N, and T / N substitutions to the same as for the A / A, C / C, G / G, and T / T substitutions. For example by doing something like this:
> mat <- diag(15L, nrow=5, ncol=5) - 5L
> rownames(mat) <- colnames(mat) <- c("A", "C", "G", "T", "N")
> mat["N", ] <- mat[ , "N"] <- mat["A", "A"]
> mat
A C G T N
A 10 -5 -5 -5 10
C -5 10 -5 -5 10
G -5 -5 10 -5 10
T -5 -5 -5 10 10
N 10 10 10 10 10
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTNTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTNTT---
score: 6
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTTTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTTTT---
score: 6
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTGTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTGTT---
score: -9
Does that make sense? H.
Yes, indeed, if substitutionMatrix
is set to the output of nucleotideSubstitutionMatrix()
, everything works as expected, as I mentioned above. My problem is with the default settings, where nucleotideSubstitutionMatrix()
is not used. Rather, each input sequence is assigned a quality of PhredQuality(22L)
(if none was specified), and quality-based alignment is performed. This causes some problems when the input sequence contains ambiguous bases, in particular N
.
To be clear, my beef is not with the use of the quality-based alignment algorithm itself. In "real world" data, any N
s should have very low qualities by definition (as the base caller could not figure out its identity), and thus should contribute little to the alignment score. The problem here is that, when input qualities are not specified, N
s and other ambiguous bases are automatically and silently assigned qualities of 22L
, resulting in strong mismatches to other bases. This is very unintuitive; it took me days to figure out why my adaptor sequences containing a stretch of N
s were not aligning to my data.
I can think of two alternatives to the current defaults:
nucleotideSubstitutionMatrix()
when two DNAString[Set]
objects are supplied without qualities. Throw an error (or warning, if you use suggestion 2 below) if only one of the objects have qualities. This is probably the clearest and most intuitive/expected for a typical user.\gamma
parameter in ?pairwiseAlignment
).Hi Aaron,
The proposed changes sound reasonable but there is a lot of code around that uses pairwiseAlignment()
and I want to take the time to investigate/evaluate how it would be impacted by these changes. We're probably too close from the next BioC release for that (I would need some time to fix the packages I break by making these changes and I don't have that time at the moment) so I'm postponing this to after the release.
Thanks,
H.
Okay, fair enough, thanks.
I eventually figured this out - the alignment scheme is working correctly, but fuzzyMatrix
needs to be specified when dealing with quality-scaled DNA strings:
pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"),
fuzzyMatrix=nucleotideSubstitutionMatrix())
# gives a score of zero.
This seems like something that should be default when pairwiseAlignment
sees DNA string inputs.
This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate.
I think PR https://github.com/Bioconductor/Biostrings/pull/77 addresses this. An asymmetric substitution matrix avoids penalizing alignments in the case of ambiguous subject sequences.
The default behaviour of
pairwiseAlignment
is to perform a quality-weighted alignment, even in the absence of any quality scores in the two input sequences. This is normally fine, as sequences without qualities are given a constant quality of22L
across all bases, and presumably this is just as reasonable as using arbitrary match/mismatch scores innucleotideSubstitutionMatrix
.However, the use of a high constant quality has odd effects when ambiguous N's are present. Consider:
... which gives a score of -29.5 (currently testing on version 2.46.0). This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate. Indeed, using a
nucleotideSubstitutionMatrix
gives me something a lot more sensible:In short; is the default quality score choice of
22L
appropriate for N's? Hacking around to assign low qualities to N's also gives something more reasonable than the default:... though obviously this would require more work when only a few bases are N.