Illumina / SpliceAI

A deep learning-based tool to identify splice variants
Other
409 stars 161 forks source link

SpliceAI output.vcf equals input.vcf #29

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi all,

I ran spliceAI with the following parameters: spliceai -I 0000.vcf -O out.vcf -R hg19.fa -A grch37. The output vcf is exactly identical to the input vcf. Also the column with the delta scores is completely missing.

What my actual plan was: Finding variants in my sample.vcf that are NOT in the spliceai_scores.raw.snv.hg19.vcf.gz by using bcftools isec, resulting in 0000.vcf.

Then I wanted to let spliceAI predict the splicing sites of these "unknown" variants in 0000.vcf and write them to out.vcf

Am I missing something or do I follow a wrong idea? It would be nice to hear your suggestions on this issue. Best regards!

kishorejaganathan commented 4 years ago

SpliceAI only scores variants within genes and completely ignores intergenic variants (genes are defined in the annotations/grch37.txt file). My guess is that the variants you are providing are either intergenic or belong to genes that don't exist in the file. Could you share with me a few more details about the variants so that I can better advise you?

ghost commented 4 years ago

Thanks for the quick reply!

Actually, I thought that the variants in intergenic regions might be also of great interest since it's known that many SNVs associated with traits are harbored by these regions.

The sample.vcf in this case is the result of a variant calling on an exome BAM-file. So it should carry genic variants only but maybe the 'unknown' variants in 0000.vcf might be due to noise, e.g. unspecific exome capture? Alternatively, the genes harboring the variants don't exist in the annotation file?

Would you suggest modifying the annotation file? Actually, I was planning to repeat the whole procedure for a genomic vcf since I was especially interested in 'unknown' variants. However, this might not be useful if I understood you correctly.

kishorejaganathan commented 4 years ago

Yes, modifying the annotation file is the easiest way forward. More precisely, you can add lines to the annotation file with the genes/isoforms that you are interested in that are missing currently and you should get scores for variants in those genes/isoforms in the output. The format is fairly intuitive, you can check one or two lines in the existing annotation file just to make sure you understood it right.

SpliceAI uses context on the RNA and not the DNA, so I'm not sure if using it to evaluate intergenic variants is worthwhile. While intergenic variants are known to affect gene expression a lot, I am not sure if they have a serious role in affecting splicing.

ghost commented 4 years ago

Thanks for your help! I followed your advice. The output file is not empty anymore. Some of the lines contain the required delta scores while others still contain informations of the input file. More severly, spliceAI stops writing to the output-VCF after some minutes (namely: at the gene "SPCS2P4") with the following error message:

Traceback (most recent call last): File "/home/nailufra/.local/bin/spliceai", line 10, in sys.exit(main()) File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/main.py", line 72, in main scores = get_delta_scores(record, ann, args.D, args.M) File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/utils.py", line 132, in get_delta_scores dist_ann = ann.get_pos_data(idxs[i], record.pos) File "/home/nailufra/.local/lib/python3.6/site-packages/spliceai/utils.py", line 62, in get_pos_data dist_exon_bdry = min(np.union1d(self.exon_starts[idx], self.exon_ends[idx])-pos, key=abs) ValueError: min() arg is an empty sequence

What I did so far: 1) Downloaded GTF: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.basic.annotation.gtf.gz 2) extracting the start and end information of each gene entry in the GTF (filtered all lines where third columns is "gene") 3) extracting the information on start and end position of "exon" entries (filtered all lines where third columns is "exon"), grouped by gene and put them comma-separated into a EXON_START and EXON_END column, respectively 4) merged the gene and exon information into one data frame 5) ´bedtools intersect´ to filter the data frame (see 4)) with the coordinates of the vcf2bed-transformed 0000.vcf file (see post above). 6) rearrange columns to get the genePred format 7) append file to grch37.txt (I also tried to only save the filtered file to grch37.txt and deleted the original entries - same error!)

To me the error message is not completely clear. It seems that the arrangement of the exon start and end positions might be wrong for this specific gene? Would be nice to hear your suggestions on this!

kishorejaganathan commented 4 years ago

It looks like you haven't added the extra lines to the annotation files in the right format. Could you copy/paste a few lines of the new genes that you have added to grch37.txt so that I can better advise you?

ghost commented 4 years ago

The first three lines/genes of the resulting genePred (see bullet point 6 in prev. post) appended to grch37.txt (they're tab-separated in the file but appear to be only space-separated in this post):

RP4-669L17.10 1 + 317720 453948 324756, 327552 326514, 328453 MXRA8 1 - 1288069 1297157 1288069, 1288071, 1288072, 1288072, 1289228, 1289228, 1289228, 1289228, 1289410, 1289410, 1289410, 1289410, 1289573, 1289573, 1289573, 1289573, 1289734, 1289734, 1289734, 1289734, 1290062, 1290062, 1290062, 1290062, 1290624, 1290624, 1290624, 1290624, 1290830, 1290830, 1290830, 1292061, 1292061, 1292061, 1292061, 1293836, 1293836, 1293836, 1296622 1288712, 1289009, 1289009, 1289009, 1289308, 1289308, 1289308, 1289326, 1289486, 1289486, 1289486, 1289486, 1289612, 1289612, 1289612, 1289612, 1289889, 1289889, 1289889, 1289889, 1290532, 1290532, 1290532, 1290532, 1290725, 1290725, 1290725, 1290725, 1291132, 1291132, 1291132, 1292084, 1292084, 1292084, 1292084, 1293915, 1293923, 1294174, 1297157 RP1-140A9.1 1 + 1822910 1824097 1822910, 1823573 1823290, 1824097

SpliceAI throws the error when processing a variant at chr: 1 and pos: 28422847. The variant is harbored by the gene "SPCS2P4". The appended line in grch37.txt corresponding to the variant causing the error is:

SPCS2P4 1 - 28422253 28422933 28422253 28422933

EDIT: Assuming that the error might be caused by 'single-exon genes' (SEGs) such as SPCS2P4, I removed all lines corresponding to SEGs and the spliceAI run finished successfully!

There are still two questions remaining with this issue: 1) What might be the problem with these entries? 2) Assuming that the GENCODE V24 annotation was used as described on the main page, why are some of the entries not present in the provided grch37.txt but in gencode.v24lift37.basic.annotation.gtf.gz?

kishorejaganathan commented 4 years ago

Here are the answers to your questions:

  1. SpliceAI also works for single exon genes. We followed the UCSC table browser annotation, which also adds a comma after the last exon. For example:

OR4F5 1 + 69090 70008 69090, 70008,

The parser inside SpliceAI implicitly uses this, and is the reason behind the error that you are facing. Please add a , at the end of each EXON_START and EXON_END entry (this statement applies for multiple exon genes as well). For example,

SPCS2P4 1 - 28422253 28422933 28422253 28422933

should be changed to

SPCS2P4 1 - 28422253 28422933 28422253, 28422933,

and everything should work.

  1. We further filtered the gene list to include only canonical transcripts (when multiple transcripts are available for the same gene, we picked the one with the largest coding sequence).
ghost commented 4 years ago

Now everything works fine - thank you very much for your generous help!