Run Splice AI as command to catch indels

davmlaw commented 2 years ago

At the moment we use pre-calculated values for SpliceAI, that only look at "all possible substitutions (snv), 1 base insertions and 1-4 base deletions (indel) within genes"

We had a potential splice variant come through

NC_000019.10:g.11113506_11113523del 19:11113504 GCGCTGATGCCCTTCTCTC>G

https://github.com/illumina/SpliceAI

We'll run some benchmarks and see how long it takes to run, and can possibly modify the pipeline.

davmlaw commented 7 months ago

Have been able to get it to run, values returned for above variant were:

{'ALLELE': 'G',
 'SYMBOL': 'LDLR',
 'DS_AG': '0.02',
 'DS_AL': '0.98',
 'DS_DG': '0.01',
 'DS_DL': '0.00',
 'DP_AG': '46',
 'DP_AL': '31',
 'DP_DG': '-19',
 'DP_DL': '-49'}

Installation

pip install tensorflow
pip install spliceai

cd /data/annotation/fasta
wget --quiet -O - http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz | gzip -d | bgzip > hg19.fa.gz
samtools faidx hg19.fa.gz
wget --quiet -O - http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz | gzip -d | bgzip > hg38.fa.gz
samtools faidx hg38.fa.gz

Then running via:

spliceai -I spliceai_test.grch38.vcf -O spliceai_test.annotated.grch38.vcf -R /data/annotation/fasta/hg38.fa.gz -A grch38

TODO:

Symbolics are silently skipped, and variant needs to be less than 2x -D option (defaults to 50)

davmlaw commented 7 months ago

There is a cadd_raw_rankscore for every record in dbNSFP - so we can just check whether that exists or not, to determine whether dbNSFP was applied to that variant.

Going forward - we should perhaps look for annotated VCFs missing that field (but perhaps also matching what SpliceAI will annotate), then run them through SpliceAI pipeline, then send it to bulk vcf importer as per normal

For historical - We could also perhaps dump variants that are missing dbnsfp?

davmlaw commented 7 months ago

I wrote the records skipped by dbNSFP on vg test:

from snpdb.models import Variant, GenomeBuild
from annotation.models import AnnotationVersion, VariantAnnotationPipelineType
from annotation.annotation_version_querysets import get_variants_qs_for_annotation
from annotation.tasks.annotate_variants import write_qs_to_vcf

av = AnnotationVersion.latest(GenomeBuild.grch38())
qs = get_variants_qs_for_annotation(av, VariantAnnotationPipelineType.STANDARD, annotated=True)
qs_no_cadd = qs.filter(variantannotation__cadd_raw_rankscore__isnull=True)
write_qs_to_vcf("/tmp/splice_ai_variants.vcf", GenomeBuild.grch38(), qs_no_cadd)

Will run spliceAI on it and see how it goes, see how many we get etc

# Running in screen on vg test
spliceai -I /tmp/splice_ai_variants.vcf -O /data/annotation/scratch/spliceai.grch38.vcf -R /data/annotation/fasta/hg38.fa.gz -A grch38

It seems to be extremely slow... it took a few minutes to do 100 records

davmlaw commented 7 months ago

Well, it runs, but it's slow...

VG test has been running all weekend and has 6162 mins of CPU time, and processed 38,889 records

This is something like 10s CPU time per variant - way too slow for running over millions of variants.

I will try to get it running on a GPU and see how that goes. We might be able to get one in a server.

There is a SpliceAI lookup REST service:

https://spliceai-38-xwkwwwxdwq-uc.a.run.app/spliceai/?hg=38&variant=chr8-140300616-T-G

Or for the variant that caused the initial report:

https://spliceai-38-xwkwwwxdwq-uc.a.run.app/spliceai/?hg=38&variant=chr19-11113504-GCGCTGATGCCCTTCTCTC-G - this took 15.3s to return

But they probably wouldn't like us hammering on it too much. Maybe we could just do it on certain variants - ie when they are looked up?

davmlaw commented 7 months ago

Benchmark on my desktop, 355 records:

CPU - 4 cores (8 threads) 3.6Ghz

real    6m31.575s
user    21m57.172s
sys 0m54.215s

GPU - Quadro P600

384 Cuda cores, 2GB RAM

real    6m50.851s
user    7m1.310s
sys 0m14.843s

CUDA install

Had some issues, purged nvidia drivers then installed again:

Then followed these instructions:

Had to do the following 1st:

CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH

For this to work:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

davmlaw commented 7 months ago

splice_ai_test.vcf.gz

davmlaw commented 5 months ago

To see whether GPU is being used:

nvidia smi - Nvidia GPU rocm-smi - AMD shows GPU usage nvtop

SACGF / variantgrid

Run Splice AI as command to catch indels #720