bmvdgeijn / WASP

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery
Apache License 2.0
102 stars 51 forks source link

find_intersecting_snps.py taking longer than expected? #94

Open rwomersley opened 4 years ago

rwomersley commented 4 years ago

Hi there,

I am currently running the find_intersecting_snps.py script from step 3 of the mapping pipeline, but it is taking a very long time to run - currently outputting the bam/fq files only up to chromosome 2 after 3 days of running (and for one of over 60 samples). I was wondering if anybody else was having a similar issue, how long does this step usually take? Perhaps I am not putting the correct inputs into the function.

My script is below;

python ./WASP-master/mapping/find_intersecting_snps.py \
             --is-sorted \
             --output_dir find_intersecting_snps \
             --snp_tab snp_tab.h5 \
             --snp_index snp_index.h5 \
             --haplotype haplotypes.h5\
             ./map1/$file.sort.bam 

I generated my snp tab, index and haplotype files as recommended in earlier steps (snp2h5), using 1000G vcf data.

Grateful for any advice you could give! Thanks in advance.

gmcvicker commented 4 years ago

Hi, the pipeline can be slow, especially if you have a large panel of samples and long reads. The reason is that when you have more samples there are more heterozygous sites that must be considered, and when you have long reads they tend to overlap more SNPs. Many more reads with flipped alleles must then be generated by find_intersect_snps.py for remapping.

In general I don't recommend running WASP for more than about 40-50 samples because it will be slow (although it is possible and people sometimes do it). With a sufficient number of samples (e.g. 50+) a standard QTL mapping approach will typically have sufficient power and will be much faster.

If your interest is only in correcting mapping bias (using the mapping pipeline), then it will be faster to run the mapping pipeline separately for each sample. In this way it only needs to consider the heterozygous sites that are present within that sample. We have also been developing a stand-alone allele-specific analysis pipeline that uses this approach (let me know if you are interested). The downside of using one sample at a time, is that the processed reads are no longer suitable for a read-depth based QTL analysis or the combined haplotype test (because the read filtering can differ between samples depending on whether a heterozygous site is present).

rwomersley commented 4 years ago

Hi Graham,

Many thanks for your reply and your advice. The majority of my issues are because I am a student on a 6 month project with only a 72 hour limit per job, so each time the find_intersecting_snps scripts runs past this it times out and I must begin again. I am already running each sample in an array job under this step, but unsure what else I could do. Alternatively, can you anticipate later issues if I were to process each chromosome separately? I also wanted to ask for your advice on modifying the script to pick-up where it left off each time I must rerun it. Would you recommend this?

My interest is in allele-specific mapping and using the CHT, although I have been running the mapping pipeline on each sample individually. I would be most grateful for details on your stand-alone allele-specific analysis pipeline too, thank you. Thank you again for your time and advice.

Best,

Rebecca

On 22 May 2020, at 23:08, Graham McVicker notifications@github.com wrote:

Hi, the pipeline can be slow, especially if you have a large panel of samples and long reads. The reason is that when you have more samples there are more heterozygous sites that must be considered, and when you have long reads they tend to overlap more SNPs. Many more reads with flipped alleles must then be generated by find_intersect_snps.py for remapping.

In general I don't recommend running WASP for more than about 40-50 samples because it will be slow (although it is possible and people sometimes do it). With a sufficient number of samples (e.g. 50+) a standard QTL mapping approach will typically have sufficient power and will be much faster.

If your interest is only in correcting mapping bias (using the mapping pipeline), then it will be faster to run the mapping pipeline separately for each sample. In this way it only needs to consider the heterozygous sites that are present within that sample. We have also been developing a stand-alone allele-specific analysis pipeline that uses this approach (let me know if you are interested). The downside of using one sample at a time, is that the processed reads are no longer suitable for a read-depth based QTL analysis or the combined haplotype test (because the read filtering can differ between samples depending on whether a heterozygous site is present).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bmvdgeijn/WASP/issues/94#issuecomment-632927659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOVIFLF26RBSZ7RSJBAYL2TRS3ZUVANCNFSM4MS65WKA.

gmcvicker commented 4 years ago

Hi Rebecca, 1) If you are running a single sample at a time, then specifying ONLY that sample using the --samples argument should make things a lot faster (if you are not already doing this). The reason is that it will then use only the sites that are heterozygous in that sample. However if you use this approach then the results from the CHT may be somewhat biased (this is because one part of the CHT uses read depth, and running one samples by having the read filtering be different due to presence/absence of het sites in different samples can change the depth).

2) Running the find_intersecting_snps.py separately for different chromosomes should be fine.

3) Working with phased haplotypes (using the --haplotype argument) can also greatly speed up find_intersecting_snps, as then it does not need to consider all combinations of alleles, just combinations of haplotypes that are actually present in the samples.

4) The simplified allele-specific pipeline (that just gets AS counts and does not include the CHT) is here: https://github.com/aryam7/as_analysis

Hope these suggestions help,

Graham