Ensembl / VEP_plugins

Plugins for the Ensembl Variant Effect Predictor (VEP)
Apache License 2.0
138 stars 115 forks source link

SpliceAI plugin is not giving the most "severe" result among duplicates #638

Closed lacek closed 9 months ago

lacek commented 1 year ago

There are duplicate records in SpliceAI v1.3 (same variant and gene symbol, but different scores), e.g. from tabix spliceai_scores.masked.snv.hg38.vcf.gz 2:241813895-241813895 19:39885875-39885875 we have:

...
2   241813895   .   A   T   .   .   SpliceAI=T|NEU4|0.00|0.00|0.11|0.00|31|0|-2|33
2   241813895   .   A   T   .   .   SpliceAI=T|NEU4|0.00|0.00|0.70|0.00|-28|0|-2|33
...
19  39885875    .   G   C   .   .   SpliceAI=C|FCGBP|0.18|0.94|0.00|0.00|25|-3|25|-23
19  39885875    .   G   C   .   .   SpliceAI=C|FCGBP|0.25|0.00|0.00|0.00|25|-3|25|-23
...

The following is the results of VEP web for the above 2 variants:

Location    Allele  SYMBOL  Feature SpliceAI_pred_DP_AG SpliceAI_pred_DP_AL SpliceAI_pred_DP_DG SpliceAI_pred_DP_DL SpliceAI_pred_DS_AG SpliceAI_pred_DS_AL SpliceAI_pred_DS_DG SpliceAI_pred_DS_DL SpliceAI_pred_SYMBOL
2:241813895-241813895   T   -   ENSR00001047391 -   -   -   -   -   -   -   -   -
2:241813895-241813895   T   -   ENST00000413820.1   -   -   -   -   -   -   -   -   -
2:241813895-241813895   T   -   ENST00000420272.2   -   -   -   -   -   -   -   -   -
2:241813895-241813895   T   -   ENST00000439270.1   -   -   -   -   -   -   -   -   -
2:241813895-241813895   T   LOC124905349    XR_007088398.1  -   -   -   -   -   -   -   -   -
2:241813895-241813895   T   NEU4    ENST00000325935.10  -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000391969.6   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000404257.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000405370.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000407683.6   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000415936.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000420288.1   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000423583.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000426032.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000428592.1   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000435855.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000435894.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000435934.1   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000476542.5   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000488997.1   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000494678.1   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    ENST00000618597.1   -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    NM_001167599.3  -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    NM_001167600.3  -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    NM_001167601.3  -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    NM_001167602.3  -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
2:241813895-241813895   T   NEU4    NM_080741.4 -28 0   -2  33  0.00    0.00    0.70    0.00    NEU4
19:39885875-39885875    C   FCGBP   ENST00000595713.1   25  -3  25  -23 0.25    0.00    0.00    0.00    FCGBP
19:39885875-39885875    C   FCGBP   ENST00000616721.6   25  -3  25  -23 0.25    0.00    0.00    0.00    FCGBP
19:39885875-39885875    C   FCGBP   NM_003890.2 25  -3  25  -23 0.25    0.00    0.00    0.00    FCGBP

In short, VEP gives

For 19-39885875-G-C, I believe it is because the record for 0.25 comes after that for 0.94, and the current implementation of the plugin loop over all records of matching the variant and gene symbol. Thus the last matched one wins.

In terms of sensitivity, one would probably want the one with max SpliceAI score (more pathogenic prediction) instead, i.e.

dglemos commented 1 year ago

Hi @lacek, Thank you for reporting this issue. The file is not supposed to have more than one score for each variant this is why the plugin is not handling these scores very well. I think it's better to contact SpliceAI's author to understand why there are multiple scores.

dglemos commented 11 months ago

As an alternative you could download the Ensembl scores calculated for the MANE select transcripts. The files are available here: http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_plugins/

lacek commented 11 months ago

@dglemos In the file http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_plugins/spliceai_scores.masked.snv.ensembl_mane.grch38.110.vcf.gz, there are also duplicate lines, e.g.

16  1534632 .   T   A   .   .   SpliceAI=A|TMEM204|0.00|0.00|0.34|0.00|-45|7|1|9,A|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1
16  1534632 .   T   A   .   .   SpliceAI=A|TMEM204|0.00|0.00|0.33|0.00|-45|7|1|9,A|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1
16  1534632 .   T   C   .   .   SpliceAI=C|TMEM204|0.00|0.00|0.00|0.00|-45|13|1|9,C|IFT140|0.00|0.00|0.00|0.00|49|-2|-35|-1
16  1534632 .   T   C   .   .   SpliceAI=C|TMEM204|0.00|0.00|0.00|0.00|-45|13|1|9,C|IFT140|0.00|0.00|0.00|0.00|49|-2|-35|-1
16  1534632 .   T   G   .   .   SpliceAI=G|TMEM204|0.00|0.00|0.00|0.00|7|-45|-15|1,G|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1
16  1534632 .   T   G   .   .   SpliceAI=G|TMEM204|0.00|0.00|0.00|0.00|7|-45|-15|1,G|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1

However, the differences of scores among the duplicates appear to be within 0.01 and therefore it shouldn't affect interpretation.

I will check with my team to see if this file fits our use case. Thank you for the advice.

dglemos commented 11 months ago

Thanks for letting us know! There was a problem with the masked file. In the next release, we are going to release the fixed version of the file.

dglemos commented 9 months ago

I'm going to close this ticket but feel free to open a new one if you have any other questions.

Best wishes, Diana