broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Allow CreateSomaticPanelOfNormals to pass --sites-only-vcf-output=false for mapping bias calculations #5649

Closed andrewrech closed 5 years ago

andrewrech commented 5 years ago

Feature request

Tool(s) or class(es) involved

CreateSomaticPanelOfNormals

Description

Currently, CreateSomaticPanelOfNormals emits sites-only VCFs. Some downstream tools require full VCFs, as could be created previously in the PON CombineVariants workflow.

Perhaps this feature will be covered when CombineVariants becomes available, but I believe it may still be desirable if CreateSomaticPanelOfNormals could pass --sites-only-vcf-output=false to allow full VCFs to be returned.

This would permit calculation of mapping bias using allele frequencies of the normal samples.

Thank you for your tremendous service developing this tool.

Sincerely,

Andrew

lima1 commented 5 years ago

To provide some more background, the idea is to generate output as generated by CollectAllelicCounts for a pool of normals so that we can correct allelic biases in tumor-only. Would it be possible that CreateSomaticPanelofNormals is extended to cover the CollectAllelicCounts "special case"?

@samuelklee @davidbenjamin.

davidbenjamin commented 5 years ago

@andrewrech @lima1 Is it absolutely necessary to retain the full per-sample information, or would it be sufficient to add an INFO field (or several) with some sort of summary statistics? For example, I'm working on an improved Mutect2 panel of normals that emits the fraction of samples in which the artifact was called and the estimated beta distribution of the artifact allele frequency among samples containing the artifact. Would this or something related meet your needs?

lima1 commented 5 years ago

Great, yes, summary would be sufficient. I currently extract the total number of alt and ref reads and the number of samples out of the old Mutect --normal_panel. The beta distribution would be great too.

davidbenjamin commented 5 years ago

This is now in PR: #5675. FilterMutectCalls is not yet hooked up to exploit any of this new information, but we will be testing ideas for that soon.

andrewrech commented 5 years ago

Amazing @davidbenjamin, thanks for this fast work!

davidbenjamin commented 5 years ago

Closed, I think, by #5675, but please let me know if any other outputs would be useful.

andrewrech commented 5 years ago

Will do, thank you again

lima1 commented 5 years ago

@davidbenjamin, thanks again, finally had time to check this out.

The fit for the beta binomial includes homozygous germline variants when present, right?

Would it be possible to specify a filter to exclude say allelic fraction > 0.9? Ideally I would want the homozygous samples counted in the FRACTION field. Or would you ask for filtering for this upstream of CreateSomaticPanelOfNormals?

davidbenjamin commented 5 years ago

@lima1 The beta binomial fit ignores germline variantion. That is, if you have a variant that shows up sometimes as an artifact and sometimes as a germline variant, the tool fits only the allele fractions of the samples where it seems to be an artifact.

The FRACTION field excludes germline variation. This is done intentionally because the -germline-resource is a much more powerful tool for germline filtering than a panel of normals.