Illumina / SpliceAI

A deep learning-based tool to identify splice variants
Other
407 stars 159 forks source link

variant not scored #137

Open carolinehey opened 1 year ago

carolinehey commented 1 year ago

Hello, I understand your explanation regarding why some variants are not scored, but none of the possibilities seem to explain why my variant is not scored. Do you have any suggestions? NM_000455.5:c.597+14delA image

kishorejaganathan commented 1 year ago

Could you give me the variant in VCF format?

rodrigodealexandre commented 1 year ago

Dear SpliceAI Staff,

I am currently facing an issue with a variant in my database. I attempted to run SpliceAI locally for this variant using multiple parameters (-D), but it failed. Strangely, the same variant seems to produce a result on the SpliceAI website.

I am working on a script to create a database that annotates only new variants using HG38. To achieve this, I created a VCF file with an HG38 header using UCSC's chromosome length size information for GRCh38/HG38 from https://genome.ucsc.edu/cgi-bin/hgTracks?chromInfoPage=.

My VCF file looks like this:

##fileformat=VCFv4.2
##fileDate=20231010
##reference=GRCh38/hg38
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr17   43125251    .   C   A   .   .   .
chr17   43125257    .   C   A   .   .   .
chr17   43125749    .   C   A   .   .   .

The original file contains more entries, but the variant that did not yield a result is on the third line. I used the following command to run SpliceAI: spliceai -I new_calls.vcf -O teste.vcf -R /mnt/d/1-bioinfotools/HG38/hg38.fa -A grch38 -D 1000. In this command, new_calls.vcf is the VCF file mentioned above, and my HG38 fasta file was downloaded from UCSC. I tried different -D inputs and ran it without the -D option.

The resulting output in teste.vcf was as follows:

##INFO=<ID=SpliceAI,Number=.,Type=String,Description="SpliceAIv1.3.1 variant annotation. These include delta scores (DS) and delta positions (DP) for acceptor gain (AG), acceptor loss (AL), donor gain (DG), and donor loss (DL). Format: ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr17   43125251    .   C   A   .   .   SpliceAI=A|BRCA1|0.00|0.00|0.11|0.10|-801|-343|-69|26
chr17   43125257    .   C   A   .   .   SpliceAI=A|BRCA1|0.00|0.00|0.04|0.08|-813|-349|-75|20
chr17   43125749    .   C   A   .   .   .

I also tried analyzing this variant in isolation and changing the region, but I consistently received similar results. Could this issue be related to a specific transcript? I noticed that the gene NBR2 is used for annotation on the SpliceAI website, but I couldn't find it in spliceai/annotations/grch38.txt. However, I think it should use BRCA1 gene since it is less than 400pb from the first non conding exon. Any insights or suggestions on resolving this issue would be greatly appreciated.

Thank you for your assistance.

kishorejaganathan commented 1 year ago

The issue is due to the transcript annotations. SpliceAI uses RNA context and not DNA context, and it uses the annotation file to determine which parts of the DNA are transcribed (it does not assign scores for variants outside this region). You just need to add the transcript to annotations/grch38.txt or provide a custom annotation file via the -A parameter (in the same format as the existing annotation files).

rodrigodealexandre commented 1 year ago

The issue is due to the transcript annotations. SpliceAI uses RNA context and not DNA context, and it uses the annotation file to determine which parts of the DNA are transcribed (it does not assign scores for variants outside this region). You just need to add the transcript to annotations/grch38.txt or provide a custom annotation file via the -A parameter (in the same format as the existing annotation files).

Hi there @kishorejaganathan, yes, I am aware of that. The transcript for the BRCA1 gene is located in the file SpliceAI/spliceai/annotations/grch38.txt

#NAME   CHROM   STRAND  TX_START    TX_END  EXON_START  EXON_END
BRCA1   17  -   43045628    43125483    43045628,43047642,43049120,43051062,43057051,43063332,43063873,43067607,43070927,43074330,43076487,43079333,43082403,43090943,43091434,43095845,43097243,43099774,43104121,43104867,43106455,43115725,43124016,43125270,    43045802,43047703,43049194,43051117,43057135,43063373,43063951,43067695,43071238,43074521,43076611,43079399,43082575,43091032,43094860,43095922,43097289,43099880,43104261,43104956,43106533,43115779,43124115,43125483,

The first exon is delimited to the position chr17:43125483, my variant is located at chr17:43125749, which is 266 bp away from the exon acceptor within the 'promoter' region. Shouldn't the argument -D 1000 have called the nearest gene within the -D range?, therefore BRCA1 gene?"

kishorejaganathan commented 1 year ago

The annotation file acts like a filter first, so all variants outside TX_START-TX_END will not get annotated regardless of the choice of D (which comes into play much later).