Open carolinehey opened 1 year ago
Could you give me the variant in VCF format?
Dear SpliceAI Staff,
I am currently facing an issue with a variant in my database. I attempted to run SpliceAI locally for this variant using multiple parameters (-D), but it failed. Strangely, the same variant seems to produce a result on the SpliceAI website.
I am working on a script to create a database that annotates only new variants using HG38. To achieve this, I created a VCF file with an HG38 header using UCSC's chromosome length size information for GRCh38/HG38 from https://genome.ucsc.edu/cgi-bin/hgTracks?chromInfoPage=
.
My VCF file looks like this:
##fileformat=VCFv4.2
##fileDate=20231010
##reference=GRCh38/hg38
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr17 43125251 . C A . . .
chr17 43125257 . C A . . .
chr17 43125749 . C A . . .
The original file contains more entries, but the variant that did not yield a result is on the third line. I used the following command to run SpliceAI: spliceai -I new_calls.vcf -O teste.vcf -R /mnt/d/1-bioinfotools/HG38/hg38.fa -A grch38 -D 1000
. In this command, new_calls.vcf
is the VCF file mentioned above, and my HG38 fasta file was downloaded from UCSC. I tried different -D inputs and ran it without the -D option.
The resulting output in teste.vcf
was as follows:
##INFO=<ID=SpliceAI,Number=.,Type=String,Description="SpliceAIv1.3.1 variant annotation. These include delta scores (DS) and delta positions (DP) for acceptor gain (AG), acceptor loss (AL), donor gain (DG), and donor loss (DL). Format: ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr17 43125251 . C A . . SpliceAI=A|BRCA1|0.00|0.00|0.11|0.10|-801|-343|-69|26
chr17 43125257 . C A . . SpliceAI=A|BRCA1|0.00|0.00|0.04|0.08|-813|-349|-75|20
chr17 43125749 . C A . . .
I also tried analyzing this variant in isolation and changing the region, but I consistently received similar results. Could this issue be related to a specific transcript? I noticed that the gene NBR2 is used for annotation on the SpliceAI website, but I couldn't find it in spliceai/annotations/grch38.txt
. However, I think it should use BRCA1 gene since it is less than 400pb from the first non conding exon. Any insights or suggestions on resolving this issue would be greatly appreciated.
Thank you for your assistance.
The issue is due to the transcript annotations. SpliceAI uses RNA context and not DNA context, and it uses the annotation file to determine which parts of the DNA are transcribed (it does not assign scores for variants outside this region). You just need to add the transcript to annotations/grch38.txt or provide a custom annotation file via the -A parameter (in the same format as the existing annotation files).
The issue is due to the transcript annotations. SpliceAI uses RNA context and not DNA context, and it uses the annotation file to determine which parts of the DNA are transcribed (it does not assign scores for variants outside this region). You just need to add the transcript to annotations/grch38.txt or provide a custom annotation file via the -A parameter (in the same format as the existing annotation files).
Hi there @kishorejaganathan, yes, I am aware of that. The transcript for the BRCA1 gene is located in the file SpliceAI/spliceai/annotations/grch38.txt
#NAME CHROM STRAND TX_START TX_END EXON_START EXON_END
BRCA1 17 - 43045628 43125483 43045628,43047642,43049120,43051062,43057051,43063332,43063873,43067607,43070927,43074330,43076487,43079333,43082403,43090943,43091434,43095845,43097243,43099774,43104121,43104867,43106455,43115725,43124016,43125270, 43045802,43047703,43049194,43051117,43057135,43063373,43063951,43067695,43071238,43074521,43076611,43079399,43082575,43091032,43094860,43095922,43097289,43099880,43104261,43104956,43106533,43115779,43124115,43125483,
The first exon is delimited to the position chr17:43125483
, my variant is located at chr17:43125749
, which is 266 bp away from the exon acceptor within the 'promoter' region. Shouldn't the argument -D 1000
have called the nearest gene within the -D
range?, therefore BRCA1 gene?"
The annotation file acts like a filter first, so all variants outside TX_START-TX_END will not get annotated regardless of the choice of D (which comes into play much later).
Hello, I understand your explanation regarding why some variants are not scored, but none of the possibilities seem to explain why my variant is not scored. Do you have any suggestions? NM_000455.5:c.597+14delA