Closed lacek closed 9 months ago
Hi @lacek, Thank you for reporting this issue. The file is not supposed to have more than one score for each variant this is why the plugin is not handling these scores very well. I think it's better to contact SpliceAI's author to understand why there are multiple scores.
As an alternative you could download the Ensembl scores calculated for the MANE select transcripts. The files are available here: http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_plugins/
@dglemos In the file http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_plugins/spliceai_scores.masked.snv.ensembl_mane.grch38.110.vcf.gz, there are also duplicate lines, e.g.
16 1534632 . T A . . SpliceAI=A|TMEM204|0.00|0.00|0.34|0.00|-45|7|1|9,A|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1
16 1534632 . T A . . SpliceAI=A|TMEM204|0.00|0.00|0.33|0.00|-45|7|1|9,A|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1
16 1534632 . T C . . SpliceAI=C|TMEM204|0.00|0.00|0.00|0.00|-45|13|1|9,C|IFT140|0.00|0.00|0.00|0.00|49|-2|-35|-1
16 1534632 . T C . . SpliceAI=C|TMEM204|0.00|0.00|0.00|0.00|-45|13|1|9,C|IFT140|0.00|0.00|0.00|0.00|49|-2|-35|-1
16 1534632 . T G . . SpliceAI=G|TMEM204|0.00|0.00|0.00|0.00|7|-45|-15|1,G|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1
16 1534632 . T G . . SpliceAI=G|TMEM204|0.00|0.00|0.00|0.00|7|-45|-15|1,G|IFT140|0.00|0.00|0.00|0.00|-35|-2|-35|-1
However, the differences of scores among the duplicates appear to be within 0.01 and therefore it shouldn't affect interpretation.
I will check with my team to see if this file fits our use case. Thank you for the advice.
Thanks for letting us know! There was a problem with the masked file. In the next release, we are going to release the fixed version of the file.
I'm going to close this ticket but feel free to open a new one if you have any other questions.
Best wishes, Diana
There are duplicate records in SpliceAI v1.3 (same variant and gene symbol, but different scores), e.g. from
tabix spliceai_scores.masked.snv.hg38.vcf.gz 2:241813895-241813895 19:39885875-39885875
we have:The following is the results of VEP web for the above 2 variants:
In short, VEP gives
For 19-39885875-G-C, I believe it is because the record for 0.25 comes after that for 0.94, and the current implementation of the plugin loop over all records of matching the variant and gene symbol. Thus the last matched one wins.
In terms of sensitivity, one would probably want the one with max SpliceAI score (more pathogenic prediction) instead, i.e.