UBod / msa

R package for multiple sequence alignment
https://github.com/UBod/msa
17 stars 11 forks source link

How to interpret the conservation score calculated by msaConservationScore function #26

Closed Rohit-Satyam closed 1 year ago

Rohit-Satyam commented 1 year ago

Hi,

I have MSA of 899 DNV viral genomes and I import the .aln file obtained from Kalign3 in R. I now wish to calculate conservation score to find conserved blocks. I thought this score to range between 0 and 1 but instead I get the following results

763626 763601 763576 763576 763701 763701 763626 763701 763576 763576

        • T - ? - - - 763626 763565 799225 763601 1212141 728182 59905 799229 799225 799225
  • ? ? A C C G T C T 799229 -167138 -102509 776754 4363780 3580253 2077529 1619409 4674159 1510678 T G C C - C G - A - 1263829 734454 1435965 3412637 790277 4193487 3199483 799225 1206592 799224 C G - - T T - A A - 4023181 2927073 799226 728425 1746929 1722217 702010 1884985 1718604 719436 A G G - G ? - T G - 860874 2649958 3211022 799224 1322224 743085 710965 1579230 3209094 719531 T - - - - - - - - - 1399745 790281 790273 790293 790293 790273 790273 790273 790293 790281

  • 790293 790277 790277 790277 790273 790293 790281 790293 790293 790281

I don't understand how do I interpret these numbers?

Code used:

ali <- readDNAMultipleAlignment("output/file.aln",format = "fasta")
msaConservationScore(ali,substitutionMatrix=BLOSUM62)
Rohit-Satyam commented 1 year ago

I just realized I have been using AA substitution matrix above and I might need a nucleotide substitution matrix. So I decided to do the following. However, I get the following error

nmx <-nucleotideSubstitutionMatrix(match = 1, mismatch = 0, baseOnly = FALSE, type = "DNA")
msaConservationScore(ali,substitutionMatrix=nmx)
Error in msaConservationScore.matrix(mat, ...) : 
  some letters occurring in alignment 'x' are missing in substitution matrix

What to do? I think that the error is generated due to missing value of - (gap) from nucleotide substitution matrix.

EDIT1: Made DNA substitution matrix the following way and the error subsides. However, I wish to be reassured if this is the right way to do it?

nmx <-nucleotideSubstitutionMatrix(match = 2, mismatch = -1, baseOnly = FALSE, type = "DNA")
nmx <- cbind(nmx,`-`=nmx[,15])
nmx <- rbind(nmx,`-`=c(nmx[,15],-0.25))
msaConservationScore(ali,substitutionMatrix=nmx)
UBod commented 1 year ago

@Rohit-Satyam, as specified in the package vignette on p. 25, "[...] msaConservationScore() computes sums of pairwise scores for a given substitution/scoring matrix". These values, therefore, are numbers that might exceed the unit inteval. Negative values definitely hint a highly volatile sequence positions, while high positive values hint at high levels of conservation. So that explains the numbers that you observed. As you already noticed, the substitution matrix also needs to score bases versus gaps. It seems you decided to copy the 'N' column/row into a '-' column/row. I agree that this is a good solution that way. So I close this issue. Thanks for using our package and for contributing this discussion!