optimisation, run recommendations

thedam commented 5 years ago

Hey, Finally I was able to run Spliceai on my server (after reinstalling conda). It uses 64CPUs but it's going really slow. Is there anything I can do to speed it up? My VCFs have ~120000 variants. Maybe I should remove these variants that are in the middle of exons? Is it caching somewhere already encountered variants? (so the same variants in another samples won't be processed again?)

Is it important warning?: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.

here is my example ongoing output:

spliceai -I xxxx_final.vcf -O output.vcf -R /mnt/ssd_01/refs/hs37d5_noHap.fa -A grch37
Using TensorFlow backend.
WARNING: Logging before flag parsing goes to stderr.
W0726 10:16:57.406675 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0726 10:16:57.473724 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0726 10:16:57.605217 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:131: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0726 10:16:57.605415 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0726 10:17:05.270048 140344263636800 deprecation_wrapper.py:119] From /home/damian/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2019-07-26 10:17:05.270867: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-07-26 10:17:05.323606: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1995195000 Hz
2019-07-26 10:17:05.335810: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5632022d3230 executing computations on platform Host. Devices:
2019-07-26 10:17:05.335847: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-63
OMP: Info #156: KMP_AFFINITY: 64 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 4 packages x 8 cores/pkg x 2 threads/core (32 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 32 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 36 maps to package 0 core 1 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 2 thread 0 
(...)
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 3 core 8 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 47 maps to package 3 core 8 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 3 core 17 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 51 maps to package 3 core 17 thread 1to package 3 core 24 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 3 core 25 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 63 maps to package 3 core 25 thread 1 
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56609 thread 0 bound to OS proc set 0
2019-07-26 10:17:05.346735: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2019-07-26 10:17:07.518215: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
/home/damian/anaconda3/lib/python3.7/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56790 thread 1 bound to OS proc set 4
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56795 thread 4 bound to OS proc set 16
OMP: Info #250: KMP_AFFINITY: pid 56609 tid 56793 thread 2 bound to OS proc set 8
(...)

thedam commented 5 years ago

ah, ok, I've found the answer. It's not really clear at the first time to figure out what is what and where is it... somewhere under a link from README:

Note: The annotations for all possible SNVs within genes are available here for download.

somehow with instruction from cli

I figured out to run such command ./bs project download --id 66029966 -o down/

there are vcfs with scores for hg19:

down/SpliceAI_supplement_ds.79b22cc932df4db8848c87afd19d78d3$ ls
exome_spliceai_scores.vcf.gz  
gencode_gtex_train.tsv 
gencode_test.tsv  
gencode_train.tsv  
gtex_junctions 
lincrna.tsv  
README  
whole_genome_filtered_spliceai_scores.vcf.gz

Can spliceai use this data directly or I should write my own scripts?

kishorejaganathan commented 5 years ago

(regarding 64 CPUs) I'm not sure SpliceAI is capable of using multiprocessing to speed things up, unless you've made code changes. On a single CPU, it scores around 4K variants per hour, the number is around 25K on a single GPU.

(caching) No, SpliceAI does not cache any variants.

(user warning) No, it is not important - you can ignore it.

(regarding the prescored variants) SpliceAI cannot use this data directly at the moment. That is a good suggestion though, and I will consider adding that functionality in the next release. Right now, what we recommend is to use to tool to only score INDELs and use the prescored list for all SNV annotations (since we've covered all SNVs). The file you're interested in is whole_genome_filtered_spliceai_scores.vcf.gz . We scored all possible SNVs from TSS start to stop of GENCODE canonical genes. To keep the file size small, we've discarded variants with scores less than 0.1.

GuoFengWang commented 1 year ago

Hi, I find that there are two types of prescored files in dataset(spliceai_scores.masked.indel.hg19.vcf.gz and spliceai_scores.raw.indel.hg19.vcf.gz), I want to know what is the difference between these two files and can I use these prescored indel files to annotate my own indel variants directly ? Many thanks @kishorejaganathan

kishorejaganathan commented 1 year ago

From FAQ #2: The raw files also include splicing changes corresponding to strengthening annotated splice sites and weakening unannotated splice sites, which are typically much less pathogenic than weakening annotated splice sites and strengthening unannotated splice sites. The delta scores of such splicing changes are set to 0 in the masked files. We recommend using raw files for alternative splicing analysis and masked files for variant interpretation.

Illumina / SpliceAI

optimisation, run recommendations #18