optimisation, run recommendations

thedam commented 5 years ago

Hey, Finally I was able to run Spliceai on my server (after reinstalling conda). It uses 64CPUs but it's going really slow. Is there anything I can do to speed it up? My VCFs have ~120000 variants. Maybe I should remove these variants that are in the middle of exons? Is it caching somewhere already encountered variants? (so the same variants in another samples won't be processed again?)

Is it important warning?: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.

here is my example ongoing output:

ah, ok, I've found the answer. It's not really clear at the first time to figure out what is what and where is it... somewhere under a link from README:

Note: The annotations for all possible SNVs within genes are available here for download.

somehow with instruction from cli

I figured out to run such command ./bs project download --id 66029966 -o down/

there are vcfs with scores for hg19:

down/SpliceAI_supplement_ds.79b22cc932df4db8848c87afd19d78d3$ ls

Can spliceai use this data directly or I should write my own scripts?

(regarding 64 CPUs) I'm not sure SpliceAI is capable of using multiprocessing to speed things up, unless you've made code changes. On a single CPU, it scores around 4K variants per hour, the number is around 25K on a single GPU.

(caching) No, SpliceAI does not cache any variants.

(user warning) No, it is not important - you can ignore it.

(regarding the prescored variants) SpliceAI cannot use this data directly at the moment. That is a good suggestion though, and I will consider adding that functionality in the next release. Right now, what we recommend is to use to tool to only score INDELs and use the prescored list for all SNV annotations (since we've covered all SNVs). The file you're interested in is whole_genome_filtered_spliceai_scores.vcf.gz . We scored all possible SNVs from TSS start to stop of GENCODE canonical genes. To keep the file size small, we've discarded variants with scores less than 0.1.

Hi, I find that there are two types of prescored files in dataset(spliceai_scores.masked.indel.hg19.vcf.gz and spliceai_scores.raw.indel.hg19.vcf.gz), I want to know what is the difference between these two files and can I use these prescored indel files to annotate my own indel variants directly ? Many thanks @kishorejaganathan

From FAQ #2: The raw files also include splicing changes corresponding to strengthening annotated splice sites and weakening unannotated splice sites, which are typically much less pathogenic than weakening annotated splice sites and strengthening unannotated splice sites. The delta scores of such splicing changes are set to 0 in the masked files. We recommend using raw files for alternative splicing analysis and masked files for variant interpretation.