Open davmlaw opened 2 years ago
Have been able to get it to run, values returned for above variant were:
{'ALLELE': 'G',
'SYMBOL': 'LDLR',
'DS_AG': '0.02',
'DS_AL': '0.98',
'DS_DG': '0.01',
'DS_DL': '0.00',
'DP_AG': '46',
'DP_AL': '31',
'DP_DG': '-19',
'DP_DL': '-49'}
Installation
pip install tensorflow
pip install spliceai
cd /data/annotation/fasta
wget --quiet -O - http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz | gzip -d | bgzip > hg19.fa.gz
samtools faidx hg19.fa.gz
wget --quiet -O - http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz | gzip -d | bgzip > hg38.fa.gz
samtools faidx hg38.fa.gz
Then running via:
spliceai -I spliceai_test.grch38.vcf -O spliceai_test.annotated.grch38.vcf -R /data/annotation/fasta/hg38.fa.gz -A grch38
TODO:
There is a cadd_raw_rankscore for every record in dbNSFP - so we can just check whether that exists or not, to determine whether dbNSFP was applied to that variant.
Going forward - we should perhaps look for annotated VCFs missing that field (but perhaps also matching what SpliceAI will annotate), then run them through SpliceAI pipeline, then send it to bulk vcf importer as per normal
For historical - We could also perhaps dump variants that are missing dbnsfp?
I wrote the records skipped by dbNSFP on vg test:
from snpdb.models import Variant, GenomeBuild
from annotation.models import AnnotationVersion, VariantAnnotationPipelineType
from annotation.annotation_version_querysets import get_variants_qs_for_annotation
from annotation.tasks.annotate_variants import write_qs_to_vcf
av = AnnotationVersion.latest(GenomeBuild.grch38())
qs = get_variants_qs_for_annotation(av, VariantAnnotationPipelineType.STANDARD, annotated=True)
qs_no_cadd = qs.filter(variantannotation__cadd_raw_rankscore__isnull=True)
write_qs_to_vcf("/tmp/splice_ai_variants.vcf", GenomeBuild.grch38(), qs_no_cadd)
Will run spliceAI on it and see how it goes, see how many we get etc
# Running in screen on vg test
spliceai -I /tmp/splice_ai_variants.vcf -O /data/annotation/scratch/spliceai.grch38.vcf -R /data/annotation/fasta/hg38.fa.gz -A grch38
It seems to be extremely slow... it took a few minutes to do 100 records
Well, it runs, but it's slow...
VG test has been running all weekend and has 6162 mins of CPU time, and processed 38,889 records
This is something like 10s CPU time per variant - way too slow for running over millions of variants.
I will try to get it running on a GPU and see how that goes. We might be able to get one in a server.
There is a SpliceAI lookup REST service:
https://spliceai-38-xwkwwwxdwq-uc.a.run.app/spliceai/?hg=38&variant=chr8-140300616-T-G
Or for the variant that caused the initial report:
https://spliceai-38-xwkwwwxdwq-uc.a.run.app/spliceai/?hg=38&variant=chr19-11113504-GCGCTGATGCCCTTCTCTC-G - this took 15.3s to return
But they probably wouldn't like us hammering on it too much. Maybe we could just do it on certain variants - ie when they are looked up?
Benchmark on my desktop, 355 records:
CPU - 4 cores (8 threads) 3.6Ghz
real 6m31.575s
user 21m57.172s
sys 0m54.215s
GPU - Quadro P600
384 Cuda cores, 2GB RAM
real 6m50.851s
user 7m1.310s
sys 0m14.843s
CUDA install
Had some issues, purged nvidia drivers then installed again:
Then followed these instructions:
Had to do the following 1st:
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH
For this to work:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
To see whether GPU is being used:
nvidia smi - Nvidia GPU rocm-smi - AMD shows GPU usage nvtop
At the moment we use pre-calculated values for SpliceAI, that only look at "all possible substitutions (snv), 1 base insertions and 1-4 base deletions (indel) within genes"
We had a potential splice variant come through
NC_000019.10:g.11113506_11113523del 19:11113504 GCGCTGATGCCCTTCTCTC>G
https://github.com/illumina/SpliceAI
We'll run some benchmarks and see how long it takes to run, and can possibly modify the pipeline.