bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Recommendations for tumor-only variant calling #3504

Closed kokyriakidis closed 3 years ago

kokyriakidis commented 3 years ago

Hi everyone,

I have some Tumor only Illumina data in my possession and I want to call both Somatic and Germline variants. My usecase is to detect ALL Somatic and Germline variants of a Tumor only exome sample.

I have some questions:

  1. Does Mutect2 detect BOTH Somatic and Germline and just tag Somatic variants with "SOMATIC" in the end? Or do I have to run also HaplotypeCaller to detect all Germline Variants from my Tumor-only exome sample?

  2. I want to achieve the highest RECALL. This is my goal and not Precision. Are there any options I have to use to disable any filtering done to the final set of variants?

  3. If I follow the Somatic variant calling user story what should I specify in the yaml file to call both Germline and Somatic variants?

    variantcaller:
    somatic: [vardict, strelka2, mutect2]
    germline: [freebayes, gatk-haplotype, strelka2]
    ensemble:
    numpass: 2

    or

    variantcaller: [mutect2, vardict]

    Does the first configuration (specifying both somatic and germline callers) work fow tumor only? In the user story it says "For tumor/normal somatic samples, bcbio can call both somatic (tumor-specific) and germline (pre-existing) variants." but I want to clarify it.

EDIT: I get the message "ValueError: MuTect, MuTect2 and Strelka2 somatic calling requires normal sample or panel" when I try to run the following config yaml with tumor only bam. So I guess Mutect does not support tumor only without panel.

---
details:
- algorithm:
    aligner: false
    mark_duplicates: false
    maxcov_downsample: false
    min_allele_fraction: 1
    realign: false
    recalibrate: false
    remove_lcr: false
    effects: snpeff
    tools_off: [collectsequencingartifacts, tumoronly-prioritization, vardict_somatic_filter, contamination]
    tools_on: [noalt_calling] 
    background: /media/kokyriakidis/PRODUCTION/Mutect2-exome-panel.vcf
    variantcaller:
    - mutect2
    - vardict 
    - freebayes
    - gatk-haplotype
    use_lowfreq_filter: false
    ensemble:
      numpass: 1
      use_filtered: true
    variant_regions: /media/kokyriakidis/PRODUCTION/B/config/Sureselect.v6.refseq.ensembl.noUTR.notfound.targetbp.hg19.bed 
  analysis: variant2
  files:
    - /media/kokyriakidis/PRODUCTION/B/input/B.bam
  genome_build: hg19
  description: B
  metadata:
    batch: Bs
    phenotype: tumor
upload:
  dir: ../final

Thank you for your time, KK

lbeltrame commented 3 years ago

bcbio has some heuristics to exclude somatic calls from germline in the absence of a non-tumor sample, basing on known population frequencies. You can look at the EPR INFO field that's added to the final VCF and filter everything that does not contain pass (contain, not equal as sometimes there are entries like pass;clinvar when the variant is in some useful database).

MuTect 2 does support not using a panel (not recommended) but bcbio is not allowing it. I have some half-finished code to submit a PR to do so, but I haven't been able to do much testing.

As for the rest:

  1. MuTect tags variants as SOMATIC so you can distinguish likely germline calls. VarDict also has INFO tags specifying the "strength" of a call ,from StrongSomatic to LikelyGermline.
  2. To reduce filtering, set min_allele_fraction to 0 in your configuration, so that no allelic fraction filtering takes place. The unfiltered calls are in the final VCFs and also in the per-sample outputs in the final folder after a bcbio run (these have the -germline suffix to indicate likely germline calls, but manual review is suggested)

I would suggest MuTect2 and VarDict as callers, those have performed reasonably well in our testing.

kokyriakidis commented 3 years ago

These are the preliminary results using MAF 0.01 (1%) and no filters:

HaplotypeCaller: 75,727 variants detected Mutect2: 62,190 variants detected Freebayes: 76,378 variants detected Vardict: 436,506 variants detected ~200.000 where PASS calls

Ensemble calls (each variant in at least 1 caller): 447,550 variants detected [gatk-haplotypecaller 919 unique variants freebayes 3969 unique variants mutect2 141 unique variants]

Ensemble calls (each variant in at least 1 caller) using PASS filter: ~2,200,000 variants detected

Do you have any idea why Vardict called so many variants?

lbeltrame commented 3 years ago

I've noticed in our own analyses too that VarDict calls a lot of variants. However it is not much of a problem for us because we follow up with fairly strict filtering. You might also want to consider the ensemble options of bcbio (see the documentation).

naumenko-sa commented 3 years ago

The whole point of Vardict was to make a very sensitive caller: https://github.com/AstraZeneca-NGS/VarDictJava. The idea was to 'call everything', filter later, for example with https://www.solvebio.com/

halimaach commented 3 years ago

Hi @kokyriakidis,

How did you use Mutect2 with Ensemble? Did you use a panel of normal?

Thanks

naumenko-sa commented 3 years ago

@halimaach He used a PON: background: /media/kokyriakidis/PRODUCTION/Mutect2-exome-panel.vcf

Thanks everyone for the great discussion!

SN