EichlerLab / smrtsv2

Structural variant caller
MIT License
53 stars 6 forks source link

RepeatMasker call and engine settings #12

Closed ggstatgen closed 5 years ago

ggstatgen commented 5 years ago

Hi Peter

I'm running the 'call' stage of the pipeline and have been running into an exception related to RepeatMasker. The exception refers to line 456 of call.snakefile:

 """RepeatMasker -species "{params.species}" """
            """-dir `dirname {output[0]}` """
            """-xsmall -no_is -e wublast """
            """-s -pa {params.threads} """
            """{input}"""

and is the following

RuleException:                                                                                  
CalledProcessError in line 456 of <CODE>/smrtsv2/rules/call.snakefile:                                                                                          
Command ' set -euo pipefail;  RepeatMasker -dir `dirname call/sv_calls/del/rm/del.fasta.out` -xsmall -no
_is -e wublast -s -pa 8 call/sv_calls/del/del.fasta ' returned non-zero exit status 1.                                                                                                    
  File "<CODE>/smrtsv2/rules/call.snakefile", line 456, in __rule_call_repeatmask_sv_fasta                                                                        File "<CODE>/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in run

I have tried running RepeatMasker directly using the arguments in the rule above and I wondered whether the issue was with the engine you are requesting in the call, -e wublast. It appears my RepeatMasker installation does not have this. I remember I had installed the tool with a couple of engines, and wasn't sure why this was not one of them. I went back to the page (RMDownload) which states that "Rights to BLAST 2.0 (WU-BLAST) have been acquired by Advanced Biocomputing, LLC."

Indeed, getting rid of the -e parameter and using the default engine (in my case NCBI/RMBLAST) seems to make it work.

So I guess my question is - should I try to source Wublast and run with that one or would any other engine do? Thanks!

paudano commented 5 years ago

We were using the WU-BLAST engine, and so it got left in the pipeline. Yes, removing -e wublast is a sensible thing to do. It doesn't affect SV calls, it's just an annotation that was in the original pipeline.

In a future version, I am taking out RepeatMasker and TRF anyway. Variant calling doesn't depend on it, and I think it's an annotation that's better done after SV calling.

ggstatgen commented 5 years ago

Thank Peter. A couple more questions as I delve deeper into your pipeline if I may.

  1. What is the rationale for the QUAL scores in the vcf.gz produced by the 'call' stage? There seems to be a [0,100] scale and I am seeing loads of calls with score [1,5], then a few [6,40[, none at [40,100[ and then loads with QUAL=100. What would you recommend a good filtering strategy to be based on the calls used in the paper?

  2. The genotype module. I understand what the genotype.json config file should contain, but I'm less clear about the samples.tab samples file. I have one sample. Which bam path do you need here? If I look under the SMRT-sv2 pipeline alignment outputs for my pacbio sample (under /align/bam) I see two bams: 0.bam and 1.bam. I imagine this is because I had run the step with option --batches 2.

Should I go ahead and merge these 2 bams for the genotyper or do require a different alignment file?

paudano commented 5 years ago

1. The QUAL score is based on the contig support and total number of contigs for the region. I don't filter on QUAL (see below), but here's how it works...

Code:

import numpy as np

def calculate_variant_quality(variant):
    try:
        return int(min(100, round(-10 * np.log10(1 - (variant.contig_support / float(variant.contig_depth))) * np.log(variant.contig_depth), 0)))
    except ZeroDivisionError:
        return 0

In this function variant.contig_support is the number of local assemblies with the SV (CONTIG_SUPPORT in the VCF INFO fileld) and variant.contig_depth is the total number of contigs aligned to the region (CONTIG_DEPTH in the INFO field). It is a PHRED scaled score of the proportion of supporting reads, which is further scaled by the total contig depth (so more contigs gives a more confident score). A maximum value of 100 is imposed, which is why you are seeing so many of them.

This code was there since the original pipeline written by John Huddleston, and I have never come up with a better scheme.

What I do is ignore QUAL and filter on the number of supporting contigs (CONTIG_SUPPORT in the VCF INFO field); I require at least two contigs to support the SV. This removes many false-positives, but it does drop some true heterozygous calls.

A phased SV caller or one based only on alignments tend to be better at picking up these heterozygous calls. However, we gain so much from local assemblies, like the ability to run the genotyper on them. We have known for a long time that mixing heterozygous SVs from both haplotypes drops hets (http://genome.cshlp.org/lookup/doi/10.1101/gr.214007.116). This is hopefully the last tool I will ever support that does not at least attempt to phase reads before assembly. This is more than you asked for, but it's important information.

2. The BAM file it is asking for is a BAM file of Illumina reads for the sample you want to genotype against. The genotyper uses the SV calls and contigs from SMRT-SV, aligns the Illumina reads to a patched version of the reference with the SV contigs, and makes genotype calls (hom-ref, het, hom-alt) from those alignments. You could, for example, search for support in 1000 genomes high-coverage samples (24 samples in phase 3 or 150 samples in polaris - see https://github.com/Illumina/Polaris/wiki/HiSeqX-Diversity-Cohort).

That BAM file may be a BAM, SAM, or CRAM. If it's CRAM, the optional REF column is recommended. In all cases, these Illumina reads may be aligned to any reference, or they may be an unaligned BAM.

Does this help clarify things?

paudano commented 5 years ago

I updated the README with better information about quality filtering. Thanks for the question, this information needed to be documented in the tool.