PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Issue with rs746071566 variant representation formats in the source database #46

Open BinglanLi opened 3 years ago

BinglanLi commented 3 years ago

bug

PharmCAT reports an error pertinent to a NUDT15 variant.

java.lang.IllegalStateException: Not an deletion: G >GGGAGTC @ g.48037796delGAGTCG
<...truncated...>

The root of the issue is related to how a specific structural variant, rs746071566, is presented. rs746071566 is a structural variant with both deletion and insertion/duplication. When the insertion and the deletion of rs746071566 is presented as separate records, VCF processing tools (like bcftools) can be confused as what to call at this locus. (explained in the following section)

Issue with rs746071566

rs746071566 is a structural variant with both deletion and insertion/duplication.

# in NUDT15_translation.json
     {
       "chromosome": "chr13",
       "position": 48037796,  ------> dbSNP suggests POS"48037783, instead
       "rsid": "rs746071566",
       "chromosomeHgvsName": "g.48037796delGAGTCG",
       "type": "DEL"
     },
     {
       "chromosome": "chr13",
       "position": 48037801,
       "rsid": "rs746071566",
       "chromosomeHgvsName": "g.48037801_48037802insGAGTCG",
       "type": "INS"
     },

As far as I understand, as the INS and the DEL are presented in different records, it will eventually be interpreted as the following in the format of VCF

# consider the position issue fixed
# deletion
chr13   48037782        rs746071566     AGGAGTC A       .       PASS    .        GT      0/0
# insertion
chr13   48037782        rs746071566     A AGGAGTC       .       PASS    .        GT      0/0

The two records are contradictory to each other. The DEL says a person is a AGGAGTC/AGGAGTC, not A/A. But the INS says that a person is A/A, not AGGAGTC/AGGAGTC.

I believe the correct format should be

# in NUDT15_translation.json
     {
       "chromosome": "chr13",
       "position": 48037782,
       "rsid": "rs746071566",
       "chromosomeHgvsName": "g.48037783",
       "resourceNote": "A(GGAGTC)3G A(GGAGTC)2G A(GGAGTC)4G A(GGAGTC)5G",
       "type": "DEL"/"SNP"
     },
# indel
chr13   48037782        rs746071566     AGGAGTC A,AGGAGTCGGAGTC        .       PASS    .        GT      0/0

The information rs746071566 is extracted from a specific source database that maintains the NUDT15 allele nomenclatures, right? Should we report the issue?