broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
162 stars 71 forks source link

How can I apply downstream filtering steps to the cleaned VCF to further control the false discovery rate? #615

Open elitefugua opened 8 months ago

elitefugua commented 8 months ago

How can I apply downstream filtering steps (minGQ filtering, FilterOutlierSamples, BatchEffect, and FilterCleanupQualRecalibration) to the cleaned VCF to further control the false discovery rate?

mwalker174 commented 1 month ago

MinGQ filtering is being deprecated in favor of the FilterGenotypes workflow. We are still working on documentation but here is an outline:

Main steps:

Instructions:

Generate inputs and run JoinRawCalls from the main branch to produce a clustered VCF of raw calls for the entire cohort.

Note: the input VCFs here use svtk-style END fields, which need to be modified no matter what version of the pipeline was used, so make sure FormatVcfForGatk.formatter_args is “--fix-end”

If your “cleaned” VCF was generated prior to v0.22-beta, you must run FormatVcfForGatk to produce a GATK-formatted cleaned VCF. If you used a later version of the pipeline, you can skip this step.

IMPORTANT: The "formatter_args" parameter may need to be modified depending on exactly what version of the pipeline was used.

If your cleaned VCF has invalid END tags, you must include the "--fix-end" flag. You can check to see if there are any END fields less than the POS of the record (e.g. for BND records).

If your cleaned VCF has GQ values are scaled to 0-999, you must use the "--scale-down-gq" flag to rescale to 0-99. If you’re unsure, you should be able to tell by looking at the VCF.

Run SVConcordance with the GATK-formatted cleaned VCF as "eval_vcf" and JoinRawCalls VCF as "truth_vcf".

Run FilterGenotypes with the SVConcordance VCF

The “ploidy_table” is generated during JoinRawCalls but wasn’t included as an output for the v0.28.4-beta release. If you kept intermediate files from JoinRawCalls, you can use the output from the CreatePloidyTableFromPed task. Otherwise, you will need to generate the ploidy table manually using this script: https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/ploidy_table_from_ped.py

Recommended default "no_call_rate_cutoff" is 0.05 but can be adjusted

Modify "sl_filter_args":

High specificity (5% DNR): "--small-del-threshold 93 --medium-del-threshold 85 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13" High sensitivity (15% DNR): "--small-del-threshold -53 --medium-del-threshold -12 --small-dup-threshold -105 --medium-dup-threshold -81 --ins-threshold -97"

Note these thresholds were generated for the first AoU-SV release, which had relatively little technical noise. You may want to iterate on these thresholds for your dataset until results are satisfactory.

Review the MainVcfQc plots generated with this workflow and compare them to the “Structural Variants QC Results” section of the AoU-SV quality report: https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report.

We recommend using de novo rates as a proxy for your FDR. Plots of this are generated in MainVcfQc if you have at least one trio.

7,000-11,000 are typical counts for total (non-BND) SVs per genome.

Higher thresholds will increase precision.

For larger call sets, you can increase the amount of sharding with RecalibrateGq.recalibrate_records_per_shard and filter_vcf_records_per_shard depending on which parts are slow. Defaults for both are 20,000. It is recommended to try to keep the total number of shards under 1000.