Open lbeltrame opened 5 years ago
The file in question (af-only-gnomad.hg38.vcf.gz
) and its b37 counterpart are ~3G in size. I can't tell if it's worth downloading them or try somehow to replicate them off the gnomAD data already available.
Thanks! I think that would be a good improvement. Can you point me to where you downloaded that file?
It's available at ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/ (that is, a specific sublocation of the GATK resource bundle). I'd give the exact link but I can't access the URL right now.
The two files are:
I wonder if it'd be feasible to actually process our gnomAD files to produce something that is palatable for the GATK. That would avoid maintaining yet another resource.
Thanks-- if the Broad's files are kind of small, it might be easier to just use their preprocessed files. Our gnomAD prep takes hours-- @naumenko-sa do you see that as well? I think bcftools annotate
in the recipe preparation is probably the culprit.
I haven't had yet the opportunity to test them, I think they're fairly minimal. A quick bcftools view
showed this:
chr1 10067 . T TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 30.35 PASS AC=3;AF=7.384e-05
chr1 10108 . CAACCCT C 46514.3 PASS AC=6;AF=0.0001525
chr1 10109 . AACCCT A 89837.3 PASS AC=48;AF=0.001223
chr1 10114 . TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA,T 36729 PASS AC=9,1;AF=0.0002246,2.496e-05
chr1 10119 . CT C 251.23 PASS AC=5;AF=0.0001249
chr1 10120 . T C 14928.7 PASS AC=10;AF=0.00025
chr1 10128 . ACCCTAACCCTAACCCTAAC A 285.71 PASS AC=3;AF=7.58e-05
chr1 10131 . CT C 378.93 PASS AC=7;AF=0.0001765
chr1 10132 . TAACCC T 18025.1 PASS AC=2;AF=5.049e-05
Hence I guess it uses just AC
and AF
for each locus.
I wonder if we need to do anything at all-- we have those fields in the gnomAD files so we can probably just feed those in directly, if GATK is parsing the file at all reasonably.
Indeed. I'll try to feed one of the gnomAD files I have and see if it's processed correctly.
Thanks Luca!
Well, it didn't blow up at least. ;) It might be worth including it. @roryk @chapmanb As I believe gnomAD is optional, this should be done only if it's there. What's the cleanest way to check whether it is present?
https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/vcfanno.py#L154 has a function find either exac or gnomAD, so I'd make a version of that that just checked for gnomad_exome.
@roryk
This looks very useful to filter germline variants. Has the --germline-resource option been integrated into bcbio?
Hello, I think we are leaning to not implementing this. I talked with Brad about it and he had a good practical take. The reasoning behind not implementing it is that if we filter them while we are calling them, the variants completely disappear and cannot be recovered. If we annotate them as being in gnomAD, then later on people can decide what they want to do with them-- whenever we add filtering, eventually folks will complain they are missing variants they are expecting to see. If we annotate then at least they can see why we filtered them, if we filter them, since at some point the variants will exist. This will slow down the mutect2 calls, since it will be caling in places it would have skipped, is the downside.
better later than never :) we need this resource for t-only mutect2 and purecn
@naumenko-sa, just looking at this, are these commits just adding the af_gnomad file or are you considering implementing the germline resources option for Mutect?
yes, I'm pushing germline resources in mutect as well.
I thought this would be interesting to have, as described in:
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php
However it needs yet another potential duplicate file which is part of the GATK resource bundle (and we already have gnomAD). I think however that the option might be very useful for tumor-only analyses to remove sources from being called (they are prefiltered) also reducing runtime times.
Thoughts?