Illumina / SpliceAI

A deep learning-based tool to identify splice variants
Other
407 stars 159 forks source link

Duplicate records in the released VCF file #139

Open lacek opened 1 year ago

lacek commented 1 year ago

Referring to Ensembl/VEP_plugins#638, it is found in spliceai_scores.masked.snv.hg38.vcf.gz that there are some variants having different scores (even they are having the same gene symbol), e.g.:

...
2   241813895   .   A   T   .   .   SpliceAI=T|NEU4|0.00|0.00|0.11|0.00|31|0|-2|33
2   241813895   .   A   T   .   .   SpliceAI=T|NEU4|0.00|0.00|0.70|0.00|-28|0|-2|33
...
19  39885875    .   G   C   .   .   SpliceAI=C|FCGBP|0.18|0.94|0.00|0.00|25|-3|25|-23
19  39885875    .   G   C   .   .   SpliceAI=C|FCGBP|0.25|0.00|0.00|0.00|25|-3|25|-23
...

Are these cases expected? If so how should we interpret such records?

kishorejaganathan commented 1 year ago

This is because we did all the scoring in hg19, and the hg38 scores were provided via liftover. When two different positions in hg19 map to the same position in hg38, you see a duplication. If you just stick to the list of genes here (https://github.com/Illumina/SpliceAI/blob/master/spliceai/annotations/grch38.txt), you will not run into this issue. My recommendation would be to rerun the scores using the tool for such examples to get the correct hg38 score and bypass liftover related issues.

lacek commented 1 year ago

@kishorejaganathan When filtering by the list of grch38 genes, there are still records of the same variant and gene with different scores, e.g.:

2   1223255 .   A   C   .   .   SpliceAI=C|SNTG2|0.00|0.00|0.01|0.00|-3|-20|-2|49
2   1223255 .   A   C   .   .   SpliceAI=C|SNTG2|0.00|0.00|0.01|0.00|-3|-20|-2|-21
2   1223255 .   A   G   .   .   SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|50|-20|-21|49
2   1223255 .   A   G   .   .   SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|-3|-20|-21|5
2   1223255 .   A   G   .   .   SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|-3|-20|-21|18
2   1223255 .   A   T   .   .   SpliceAI=T|SNTG2|0.00|0.00|0.09|0.00|12|-20|-2|49
2   1223255 .   A   T   .   .   SpliceAI=T|SNTG2|0.00|0.00|0.62|0.00|12|50|-2|-21
2   1223255 .   A   T   .   .   SpliceAI=T|SNTG2|0.00|0.00|0.66|0.00|12|-20|-2|-21
17  1153657 .   A   C   .   .   SpliceAI=C|ABR|0.00|0.00|0.00|0.00|4|14|4|-25
17  1153657 .   A   C   .   .   SpliceAI=C|ABR|0.00|0.00|0.01|0.00|-44|3|4|-21
17  1153657 .   A   G   .   .   SpliceAI=G|ABR|0.00|0.00|0.00|0.00|43|-15|4|33
17  1153657 .   A   G   .   .   SpliceAI=G|ABR|0.00|0.00|0.01|0.00|3|-44|4|33
17  1153657 .   A   T   .   .   SpliceAI=T|ABR|0.00|0.00|0.04|0.00|43|-44|4|33
17  1153657 .   A   T   .   .   SpliceAI=T|ABR|0.00|0.00|0.19|0.00|43|-44|4|33
kishorejaganathan commented 1 year ago

Ah, thanks for bringing this to my attention. I accepted all genes which had same number of exons and matching exon lengths between the two annotations. These genes meet that criteria but have some liftover issues in introns. You can ignore such genes or run SpliceAI with hg38 annotations from scratch instead to avoid liftover issues.