bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
983 stars 355 forks source link

RFE: Use germline resources in MuTect2 to reduce artifacts in tumor-only mode #2873

Open lbeltrame opened 5 years ago

lbeltrame commented 5 years ago

I thought this would be interesting to have, as described in:

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php

However it needs yet another potential duplicate file which is part of the GATK resource bundle (and we already have gnomAD). I think however that the option might be very useful for tumor-only analyses to remove sources from being called (they are prefiltered) also reducing runtime times.

Thoughts?

lbeltrame commented 5 years ago

The file in question (af-only-gnomad.hg38.vcf.gz) and its b37 counterpart are ~3G in size. I can't tell if it's worth downloading them or try somehow to replicate them off the gnomAD data already available.

roryk commented 5 years ago

Thanks! I think that would be a good improvement. Can you point me to where you downloaded that file?

lbeltrame commented 5 years ago

It's available at ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/ (that is, a specific sublocation of the GATK resource bundle). I'd give the exact link but I can't access the URL right now.

The two files are:

lbeltrame commented 4 years ago

I wonder if it'd be feasible to actually process our gnomAD files to produce something that is palatable for the GATK. That would avoid maintaining yet another resource.

roryk commented 4 years ago

Thanks-- if the Broad's files are kind of small, it might be easier to just use their preprocessed files. Our gnomAD prep takes hours-- @naumenko-sa do you see that as well? I think bcftools annotate in the recipe preparation is probably the culprit.

lbeltrame commented 4 years ago

I haven't had yet the opportunity to test them, I think they're fairly minimal. A quick bcftools view showed this:

chr1    10067   .       T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      30.35   PASS    AC=3;AF=7.384e-05
chr1    10108   .       CAACCCT C       46514.3 PASS    AC=6;AF=0.0001525
chr1    10109   .       AACCCT  A       89837.3 PASS    AC=48;AF=0.001223
chr1    10114   .       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA  CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA,T      36729   PASS    AC=9,1;AF=0.0002246,2.496e-05
chr1    10119   .       CT      C       251.23  PASS    AC=5;AF=0.0001249
chr1    10120   .       T       C       14928.7 PASS    AC=10;AF=0.00025
chr1    10128   .       ACCCTAACCCTAACCCTAAC    A       285.71  PASS    AC=3;AF=7.58e-05
chr1    10131   .       CT      C       378.93  PASS    AC=7;AF=0.0001765
chr1    10132   .       TAACCC  T       18025.1 PASS    AC=2;AF=5.049e-05

Hence I guess it uses just AC and AF for each locus.

roryk commented 4 years ago

I wonder if we need to do anything at all-- we have those fields in the gnomAD files so we can probably just feed those in directly, if GATK is parsing the file at all reasonably.

lbeltrame commented 4 years ago

Indeed. I'll try to feed one of the gnomAD files I have and see if it's processed correctly.

roryk commented 4 years ago

Thanks Luca!

lbeltrame commented 4 years ago

Well, it didn't blow up at least. ;) It might be worth including it. @roryk @chapmanb As I believe gnomAD is optional, this should be done only if it's there. What's the cleanest way to check whether it is present?

roryk commented 4 years ago

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/vcfanno.py#L154 has a function find either exac or gnomAD, so I'd make a version of that that just checked for gnomad_exome.

hliu2016 commented 4 years ago

@roryk
This looks very useful to filter germline variants. Has the --germline-resource option been integrated into bcbio?

roryk commented 4 years ago

Hello, I think we are leaning to not implementing this. I talked with Brad about it and he had a good practical take. The reasoning behind not implementing it is that if we filter them while we are calling them, the variants completely disappear and cannot be recovered. If we annotate them as being in gnomAD, then later on people can decide what they want to do with them-- whenever we add filtering, eventually folks will complain they are missing variants they are expecting to see. If we annotate then at least they can see why we filtered them, if we filter them, since at some point the variants will exist. This will slow down the mutect2 calls, since it will be caling in places it would have skipped, is the downside.

naumenko-sa commented 3 years ago

better later than never :) we need this resource for t-only mutect2 and purecn

waemm commented 3 years ago

@naumenko-sa, just looking at this, are these commits just adding the af_gnomad file or are you considering implementing the germline resources option for Mutect?

naumenko-sa commented 3 years ago

yes, I'm pushing germline resources in mutect as well.