broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
168 stars 70 forks source link

gnomAD AC annotation #363

Open alsanju opened 2 years ago

alsanju commented 2 years ago

Feature request

Module(s) or script(s) involved

AnnotateVcf Module

Description

Could we please add gnomAD allele counts in addition to the gnomad allele frequency?

alsanju commented 1 year ago

Following up on this request - we would need gnomAD allele counts and allele numbers (just for the total gnomAD cohort, no need by population), since for analysts these are easier to comprehend than a low allele frequency.

Would it also be possible to get the number of homozygote counts?

alsanju commented 1 year ago

I approached this issue by reformatting the ref_bed file in AnnotateExternalAF task, to look like below:

#chrom  start   end name    svtype  SVTYPE  SVLEN   AF  AC_AF   AN_AF   MALE_AF FEMALE_AF   AFR_AF  AMR_AF  EAS_AF  EUR_AF
chr1    10641   10642   gnomAD-SV_v2.1_BND_1_1  BND BND -1  0.00678599998354912 145 21366   0.00634999992325902 0.00726999994367361 0.00766099989414215 0.00366499996744096 0.013501999899745   0.00422499980777502
chr1    20999   26000   gnomAD-SV_v2.1_DEL_1_1  DEL DEL 5000    0.0160729996860027  138 8586    0.0160390008240938  0.0159179996699095  0.00726199988275766 0.0116959996521473  0.0719999969005585  0.0143449995666742
chr1    39999   47200   gnomAD-SV_v2.1_DUP_1_1  DUP DUP 7200    0.068962998688221   943 13674   0.0725499987602234  0.0652879998087883  0.135693997144699   0.0228760000318289  0.011009999550879   0.00784599967300892
.....

This works as the annotation script will use the AF or _AF fields in the reference file. However, the VCF output contains the fields in the header:

##INFO=<ID=gnomAD_V2_AF,Number=1,Type=Float,Description="Allele frequency (for biallelic sites) or copy-state frequency (for multiallelic sites) of an overlapping event in gnomad.">
##INFO=<ID=gnomAD_V2_AC_AF,Number=1,Type=Float,Description="Allele frequency (for biallelic sites) or copy-state frequency (for multiallelic sites) of an overlapping event in gnomad.">
##INFO=<ID=gnomAD_V2_AN_AF,Number=1,Type=Float,Description="Allele frequency (for biallelic sites) or copy-state frequency (for multiallelic sites) of an overlapping event in gnomad.">

Would it be possible to fix the AC and AN fields to be gnomAD_V2_AC and gnomAD_V2_AN in the header?

Thank you