bmvdgeijn / WASP

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery
Apache License 2.0
102 stars 51 forks source link

Whole genome sequencing data #111

Open maggietsui opened 2 years ago

maggietsui commented 2 years ago

Hello,

Is the WASP mapping pipeline suitable/recommended to be used on WGS data? I have followed some suggestions here to split some of the steps by chromosome since the files are large. However, find_intersecting_snps.py still takes 2+ days to run on a cluster per chromosome for a single sample. For each job I allotted 6 cores, 16G per core. Thanks for your time.

gmcvicker commented 2 years ago

Can you share some of the characteristics of your sample(s)? Normally find_intersecting_snps.py does not take so long, however it can be slow if: (1) you have a very high density of SNPs, (2) are using quite a few samples, or (3) have long reads.

The slowness typically happens when long reads overlap a lot of SNPs. The reason is that WASP generates all allelic combinations of reads that overlap SNPs that are polymorphic in your dataset. For example if a read overlaps 10 SNPs then 1024 combinations of alleles must be considered. This can be avoided if you have phased SNPs, in which case you should provide the haplotypes.h5 file as an argument to find_intersecting_snps. By providing haplotypes WASP can consider all existing combinations of haplotypes, rather than all possible combinations of alleles (many of which do not actually exist your dataset).

It can also be faster to run find_intersecting_snps.py one sample at a time. This is a good idea if you plan to focus only on allelic imbalance (e.g. for ASE estimation). If you plan to run the combined haplotype test then you will need to run all the samples together unfortunately to avoid potential biases in read depths between samples.