gnomAD r4.1.0 for G2P - Githubissues

Describe the issue

Hi,

We are currently using the G2P plugin based on the the global MAF in the locally downloaded gnomAD variants (gnomADg_r3.1.1_GRCh38 and gnomADe_r2.1.1_GRCh38).

The call to G2P is done as

IN_FILE=my.vcf
G2P_LOG_DIR=${G2P_DIR}/my_LOG_DIR
mkdir ${G2P_LOG_DIR}
TXT_OUT=${G2P_LOG_DIR}/my.report.txt
HTML_OUT=${G2P_LOG_DIR}/my.report.html
VCF_KEYS='gnomADe_r2.1.1_GRCh38|gnomADg_r3.1.1_GRCh38'

time ${VEP} \
    -i ${IN_FILE} \
    --output_file ${G2P_LOG_DIR}/my_inter_out.txt \
    --force_overwrite \
    --assembly GRCh38 \
    --fasta ${REFERENCE_GENOME} \
    --offline \
    --merged \
    --use_given_ref \
    --cache --cache_version 100 \
    --dir_cache /path/to/genomes/Hsapiens/hg38/vep \
    --individual all \
    --transcript_filter "gene_symbol in /path/to/G2P/genes_in_DDG2P.20240124.txt" \
    --dir_plugins /path/to/ensembl-vep-100.4-0 \
    --plugin G2P,file='path/to/DDG2P.20240124.csv',af_from_vcf=1,confidence_levels='definitive&strong&moderate',af_from_vcf_keys=${VCF_KEYS},log_dir=${G2P_LOG_DIR},txt_report=${TXT_OUT},html_report=${HTML_OUT}

To facilitate using the global MAF annotation from the locally downloaded gnomAD VCFs (and only them) we have modified the ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json file (see https://www.ebi.ac.uk/gene2phenotype/g2p_vep_plugin, the af_from_vcf parameter), and the json file looks like this

{
  "collections": [
    {
      "id": "gnomADg_r3.1.1_GRCh38",
      "description": "Genome Aggregation Database genomes r3.1.1",
      "species": "homo_sapiens",
      "assembly": "GRCh38",
      "type": "local",
      "filename_template": "/path/to/r3.1.1/genomes/gnomad.genomes.v3.1.1.sites.chr###CHR###.vcf.bgz",
      "chromosomes": [
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14","15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"
      ],
      "use_seq_region_synonyms": 1,
      "population_prefix": "gnomADg:",
      "population_display_group": {
        "display_group_name": "gnomAD genomes",
        "display_group_priority": 1.6
      },
      "populations": {
        "9900028": {
          "name": "gnomADg:ALL",
          "_raw_name": "",
          "description": "All gnomAD genomes individuals"
        }
      }
    },
    {
      "id": "gnomADe_r2.1.1_GRCh38",
      "description": "Genome Aggregation Database exomes r2.1.1 liftover to GRCh38",
      "species": "homo_sapiens",
      "assembly": "GRCh38",
      "type": "local",
      "filename_template": "/path/to/r2.1.1/exomes/gnomad.exomes.r2.1.1.sites.###CHR###.liftover_grch38.vcf.bgz",
      "chromosomes": [
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14","15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"
      ],
      "population_prefix": "gnomADe:",
      "population_display_group": {
        "display_group_name": "gnomAD exomes",
        "display_group_priority": 1.4
      },
      "populations": {
        "9900010": {
          "name": "gnomADe:ALL",
          "_raw_name": "",
          "description": "All gnomAD exomes individuals"
        }
      }
    }
  ]
}

This is all working.

With the release of gnomAD r4.1.0, we would like to update our scripts to using it and replace the old gnomAD VCFs.

My questions are related to (after downloading the gnomAD r.4.1.0 data and adjusting accordingly id,description and filename_template ) how should I adapt the vcf_config.json file?

should I keep this line: "use_seq_region_synonyms": 1? (and why there is no equivalent for exomes)
do these matter "display_group_priority": 1.6 (for genomes) and "display_group_priority": 1.4?
what will be appropriate values for the populations IDs - currently 9900028 for genomes and 9900010 for exomes?

Thank you in advance, Mike

Ensembl / ensembl-vep

gnomAD r4.1.0 for G2P #1697

Describe the issue