Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
27 stars 20 forks source link

nonsense mutations from DBS across two codons misclassified as nonsynonymous? #84

Open ChristofferFlensburg opened 4 months ago

ChristofferFlensburg commented 4 months ago

Hi! VariantAnnotation is a great package and I enjoy depending on it in my own package, thanks for that!

I'm looking at adding support for dibase substitutions, and seems there might be a little bug in the annotation there.

From

coding = VariantAnnotation::predictCoding(vcf, txdb, seqSource = fafile)

on a few hundred variants, I saw a few dibase variants that affected two amino acids, where one of them turned into stop, but CONSEQUENCE (incorrectly I think) classified it as a nonsynomous variant:

> coding[3,]
GRanges object with 1 range and 16 metadata columns:
      seqnames            ranges strand |            REF             ALT
         <Rle>         <IRanges>  <Rle> | <DNAStringSet> <CharacterList>
  [1]        1 27877576-27877577      - |             GG              AA
           QUAL      FILTER      varAllele    CDSLOC    PROTEINLOC   QUERYID
      <numeric> <character> <DNAStringSet> <IRanges> <IntegerList> <integer>
  [1]        40        PASS             TT 1050-1051       350,351        12
             TXID         CDSID      GENEID   CONSEQUENCE       REFCODON
      <character> <IntegerList> <character>      <factor> <DNAStringSet>
  [1]        4786   14376,14377       27245 nonsynonymous         CCCCAG
            VARCODON         REFAA         VARAA
      <DNAStringSet> <AAStringSet> <AAStringSet>
  [1]         CCTTAG            PQ            P*
  -------
  seqinfo: 93 sequences from an unspecified genome

While DBS that stayed in one codon were (correctly) annotated as nonsense.

> coding[coding$CONSEQUENCE=='nonsense',][1,]
GRanges object with 1 range and 16 metadata columns:
      seqnames              ranges strand |            REF             ALT
         <Rle>           <IRanges>  <Rle> | <DNAStringSet> <CharacterList>
  [1]        3 164739107-164739108      - |             AT              TA
           QUAL      FILTER      varAllele    CDSLOC    PROTEINLOC   QUERYID
      <numeric> <character> <DNAStringSet> <IRanges> <IntegerList> <integer>
  [1]        40        PASS             TA 3163-3164          1055        69
             TXID         CDSID      GENEID CONSEQUENCE       REFCODON
      <character> <IntegerList> <character>    <factor> <DNAStringSet>
  [1]       16983         52835        6476    nonsense            ATA
            VARCODON         REFAA         VARAA
      <DNAStringSet> <AAStringSet> <AAStringSet>
  [1]            TAA             I             *
  -------
  seqinfo: 93 sequences from an unspecified genome

Seems it's checking if VARAA is identical to "*", while it really should check if it contains "*", as that is enough to truncate the protein. Line 184 of methods-predictCoding.R

    consequence[nonsynonymous & (as.character(varAA) %in% "*")] <- "nonsense" 

Possibly should be something along the lines of

    consequence[nonsynonymous & grepl("\\*", as.character(varAA))] <- "nonsense" 

I'm adding a quick fix for it in my code

coding = VariantAnnotation::predictCoding(vcf, txdb, seqSource = fafile)
coding$CONSEQUENCE[grepl("\\*", as.character(coding$VARAA))] = 'nonsense'

so no hurry for me.