Open lacek opened 1 year ago
This is because we did all the scoring in hg19, and the hg38 scores were provided via liftover. When two different positions in hg19 map to the same position in hg38, you see a duplication. If you just stick to the list of genes here (https://github.com/Illumina/SpliceAI/blob/master/spliceai/annotations/grch38.txt), you will not run into this issue. My recommendation would be to rerun the scores using the tool for such examples to get the correct hg38 score and bypass liftover related issues.
@kishorejaganathan When filtering by the list of grch38 genes, there are still records of the same variant and gene with different scores, e.g.:
2 1223255 . A C . . SpliceAI=C|SNTG2|0.00|0.00|0.01|0.00|-3|-20|-2|49
2 1223255 . A C . . SpliceAI=C|SNTG2|0.00|0.00|0.01|0.00|-3|-20|-2|-21
2 1223255 . A G . . SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|50|-20|-21|49
2 1223255 . A G . . SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|-3|-20|-21|5
2 1223255 . A G . . SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|-3|-20|-21|18
2 1223255 . A T . . SpliceAI=T|SNTG2|0.00|0.00|0.09|0.00|12|-20|-2|49
2 1223255 . A T . . SpliceAI=T|SNTG2|0.00|0.00|0.62|0.00|12|50|-2|-21
2 1223255 . A T . . SpliceAI=T|SNTG2|0.00|0.00|0.66|0.00|12|-20|-2|-21
17 1153657 . A C . . SpliceAI=C|ABR|0.00|0.00|0.00|0.00|4|14|4|-25
17 1153657 . A C . . SpliceAI=C|ABR|0.00|0.00|0.01|0.00|-44|3|4|-21
17 1153657 . A G . . SpliceAI=G|ABR|0.00|0.00|0.00|0.00|43|-15|4|33
17 1153657 . A G . . SpliceAI=G|ABR|0.00|0.00|0.01|0.00|3|-44|4|33
17 1153657 . A T . . SpliceAI=T|ABR|0.00|0.00|0.04|0.00|43|-44|4|33
17 1153657 . A T . . SpliceAI=T|ABR|0.00|0.00|0.19|0.00|43|-44|4|33
Ah, thanks for bringing this to my attention. I accepted all genes which had same number of exons and matching exon lengths between the two annotations. These genes meet that criteria but have some liftover issues in introns. You can ignore such genes or run SpliceAI with hg38 annotations from scratch instead to avoid liftover issues.
Referring to Ensembl/VEP_plugins#638, it is found in
spliceai_scores.masked.snv.hg38.vcf.gz
that there are some variants having different scores (even they are having the same gene symbol), e.g.:Are these cases expected? If so how should we interpret such records?