Whole genome sequencing data

Can you share some of the characteristics of your sample(s)? Normally find_intersecting_snps.py does not take so long, however it can be slow if: (1) you have a very high density of SNPs, (2) are using quite a few samples, or (3) have long reads.

The slowness typically happens when long reads overlap a lot of SNPs. The reason is that WASP generates all allelic combinations of reads that overlap SNPs that are polymorphic in your dataset. For example if a read overlaps 10 SNPs then 1024 combinations of alleles must be considered. This can be avoided if you have phased SNPs, in which case you should provide the haplotypes.h5 file as an argument to find_intersecting_snps. By providing haplotypes WASP can consider all existing combinations of haplotypes, rather than all possible combinations of alleles (many of which do not actually exist your dataset).

It can also be faster to run find_intersecting_snps.py one sample at a time. This is a good idea if you plan to focus only on allelic imbalance (e.g. for ASE estimation). If you plan to run the combined haplotype test then you will need to run all the samples together unfortunately to avoid potential biases in read depths between samples.

bmvdgeijn / WASP

Whole genome sequencing data #111