Illumina / strelka

Strelka2 germline and somatic small variant caller
GNU General Public License v3.0
357 stars 103 forks source link

Multi-sample somatic calling #59

Open michaelnoe opened 6 years ago

michaelnoe commented 6 years ago

Hi! Is there a way to do somatic calling on multiple samples at once? I have multiple samples from different areas in a tumor of a patient. As part of a phylogenetic analysis, I would like to call the presence of a mutation over multiple samples, if that mutation passes the filters at least in one sample. I have tried to first do a somatic calling on the different samples and than give the VCF's in the command through --forceGT and do a combined genotype calling on all the samples. Sometimes, however, the mutation is not found during genotyping, while it passed the filters during somatic calling. When this occurs, i get a 'NoPassedVariantGTs' flag in the filter, and it only calls the reference nucleotide (no info on the other nucleotides). Is there a more elegant way to tackle this problem?

Thanks for your help!

ctsa commented 6 years ago

There is not a good solution for this case. It has always been the intention for both manta and strelka that we would eventually support any number of tumor samples against a single control, but I don't know when we'll be able to make this a development priority.

The recommended protocol for now, is to run strelka somatic analysis on each sample individually, merge the somatic variants from all samples to capture the union of all candidate somatic indels, and then rerun each sample through the somatic analysis workflow, this time supplying the candidate somatic indel file using the --forcedGT option. This sounds similar to what you're describing above, except you're mentioning an issue with the germline calling workflow, so I'm a bit confused.

I will leave this ticket as a marker that we should add a paragraph describing this protocol to the user guide.

michaelnoe commented 6 years ago

Yes, I was unaware I could use --forcedGT also in the somatic calling mode (since i thought the term 'genotyping' (from the GT in --forcedGT) was more used for germline DNA. I used it with the germline-calling mode on multiple samples at once. I tried it in the somatic calling mode and this worked! Thank you very much!

sckinta commented 5 years ago

@ctsa I tried to follow your suggestion https://github.com/Illumina/strelka/issues/59 to use --forcedGT to re-genotyping tumor sample on merged calls. I check the log, it reports "Workflow successfully completed all tasks". However, no new vcf files were generated in results/variants.

Here are what I did

  1. merge SNV and indel vcf to all.vcf in each previous pairwise somatic call.

    bcftools concat -O v -a -D $sample/results/variants/somatic.snvs.vcf.gz $sample/results/variants/somatic.indels.vcf.gz > $sample/results/variants/somatic.all.vcf
    bgzip $sample/results/variants/somatic.all.vcf
    tabix -p vcf $sample/results/variants/somatic.all.vcf.gz
  2. merge all.vcf from all tumor samples

    vcf-merge sample1/results/variants/somatic.all.vcf.gz sample2/results/variants/somatic.all.vcf.gz sample3/results/variants/somatic.all.vcf.gz sample4/results/variants/somatic.all.vcf.gz | bgzip -c > allSample.somatic.all.vcf.gz
  3. filter allSample.somatic.all.vcf with PASS

    bcftools view -f "PASS" -O v allSample.somatic.all.vcf.gz > allSample_pass.somatic.all.vcf
    bgzip allSample_pass.somatic.all.vcf
    tabix -p vcf allSample_pass.somatic.all.vcf.gz
  4. get indel only from allSample_pass.somatic.all.vcf

    
    bcftools view -V snps -O v allSample_pass.somatic.all.vcf.gz > allSample_pass.somatic.indel.vcf
    bgzip allSample_pass.somatic.indel.vcf
    tabix -p vcf allSample_pass.somatic.indel.vcf.gz
5. run somatic call again (re-genotype) on allSample_pass.somatic.all.vcf using tumor-norm pairwise samples.

module load python/2.7.14

${STRELKA_INSTALL_PATH}/bin/configureStrelkaSomaticWorkflow.py \ --normalBam $normal_bam \ --tumorBam $tumor_bam \ --referenceFasta referenceSequences/hg19/hg19.fa \ --indelCandidates ${MANTA_ANALYSIS_PATH}/results/variants/candidateSmallIndels.vcf.gz \ --indelCandidates allSample_pass.somatic.indels.vcf.gz \ --forcedGT allSample_pass.somatic.all.vcf.gz \ --callRegions referenceSequences/hg19/hg19.chrom.bed.gz \ --outputCallableRegions \ --reportEVSFeatures \ --runDir ${STRELKA_ANALYSIS_PATH}

python ${STRELKA_ANALYSIS_PATH}/runWorkflow.py -m local -j 8



Could you help?
peter-yufan-zeng commented 4 years ago

@ctsa are there any updates in Strelka to allow multiregion calling?

wangshun1121 commented 3 years ago

@ctsa I tried to follow your suggestion #59 to use --forcedGT to re-genotyping tumor sample on merged calls. I check the log, it reports "Workflow successfully completed all tasks". However, no new vcf files were generated in results/variants.

Here are what I did

  1. merge SNV and indel vcf to all.vcf in each previous pairwise somatic call.
bcftools concat -O v -a -D $sample/results/variants/somatic.snvs.vcf.gz $sample/results/variants/somatic.indels.vcf.gz > $sample/results/variants/somatic.all.vcf
bgzip $sample/results/variants/somatic.all.vcf
tabix -p vcf $sample/results/variants/somatic.all.vcf.gz
  1. merge all.vcf from all tumor samples
vcf-merge sample1/results/variants/somatic.all.vcf.gz sample2/results/variants/somatic.all.vcf.gz sample3/results/variants/somatic.all.vcf.gz sample4/results/variants/somatic.all.vcf.gz | bgzip -c > allSample.somatic.all.vcf.gz
  1. filter allSample.somatic.all.vcf with PASS
bcftools view -f "PASS" -O v allSample.somatic.all.vcf.gz > allSample_pass.somatic.all.vcf
bgzip allSample_pass.somatic.all.vcf
tabix -p vcf allSample_pass.somatic.all.vcf.gz
  1. get indel only from allSample_pass.somatic.all.vcf
bcftools view -V snps -O v allSample_pass.somatic.all.vcf.gz > allSample_pass.somatic.indel.vcf
bgzip allSample_pass.somatic.indel.vcf
tabix -p vcf allSample_pass.somatic.indel.vcf.gz
  1. run somatic call again (re-genotype) on allSample_pass.somatic.all.vcf using tumor-norm pairwise samples.

module load python/2.7.14 

${STRELKA_INSTALL_PATH}/bin/configureStrelkaSomaticWorkflow.py \
        --normalBam $normal_bam \
        --tumorBam $tumor_bam \
        --referenceFasta referenceSequences/hg19/hg19.fa \
        --indelCandidates ${MANTA_ANALYSIS_PATH}/results/variants/candidateSmallIndels.vcf.gz \
        --indelCandidates allSample_pass.somatic.indels.vcf.gz \
        --forcedGT allSample_pass.somatic.all.vcf.gz \
        --callRegions referenceSequences/hg19/hg19.chrom.bed.gz \
        --outputCallableRegions \
        --reportEVSFeatures \
        --runDir ${STRELKA_ANALYSIS_PATH}

python ${STRELKA_ANALYSIS_PATH}/runWorkflow.py -m local -j 8

Could you help?

This is what I need in tumor sub-clone study!