Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

gnomAD r4.1.0 for G2P #1697

Open MikeHala opened 2 weeks ago

MikeHala commented 2 weeks ago

Describe the issue

Hi,

We are currently using the G2P plugin based on the the global MAF in the locally downloaded gnomAD variants (gnomADg_r3.1.1_GRCh38 and gnomADe_r2.1.1_GRCh38).

The call to G2P is done as

IN_FILE=my.vcf
G2P_LOG_DIR=${G2P_DIR}/my_LOG_DIR
mkdir ${G2P_LOG_DIR}
TXT_OUT=${G2P_LOG_DIR}/my.report.txt
HTML_OUT=${G2P_LOG_DIR}/my.report.html
VCF_KEYS='gnomADe_r2.1.1_GRCh38|gnomADg_r3.1.1_GRCh38'

time ${VEP} \
    -i ${IN_FILE} \
    --output_file ${G2P_LOG_DIR}/my_inter_out.txt \
    --force_overwrite \
    --assembly GRCh38 \
    --fasta ${REFERENCE_GENOME} \
    --offline \
    --merged \
    --use_given_ref \
    --cache --cache_version 100 \
    --dir_cache /path/to/genomes/Hsapiens/hg38/vep \
    --individual all \
    --transcript_filter "gene_symbol in /path/to/G2P/genes_in_DDG2P.20240124.txt" \
    --dir_plugins /path/to/ensembl-vep-100.4-0 \
    --plugin G2P,file='path/to/DDG2P.20240124.csv',af_from_vcf=1,confidence_levels='definitive&strong&moderate',af_from_vcf_keys=${VCF_KEYS},log_dir=${G2P_LOG_DIR},txt_report=${TXT_OUT},html_report=${HTML_OUT}

To facilitate using the global MAF annotation from the locally downloaded gnomAD VCFs (and only them) we have modified the ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json file (see https://www.ebi.ac.uk/gene2phenotype/g2p_vep_plugin, the af_from_vcf parameter), and the json file looks like this

{
  "collections": [
    {
      "id": "gnomADg_r3.1.1_GRCh38",
      "description": "Genome Aggregation Database genomes r3.1.1",
      "species": "homo_sapiens",
      "assembly": "GRCh38",
      "type": "local",
      "filename_template": "/path/to/r3.1.1/genomes/gnomad.genomes.v3.1.1.sites.chr###CHR###.vcf.bgz",
      "chromosomes": [
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14","15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"
      ],
      "use_seq_region_synonyms": 1,
      "population_prefix": "gnomADg:",
      "population_display_group": {
        "display_group_name": "gnomAD genomes",
        "display_group_priority": 1.6
      },
      "populations": {
        "9900028": {
          "name": "gnomADg:ALL",
          "_raw_name": "",
          "description": "All gnomAD genomes individuals"
        }
      }
    },
    {
      "id": "gnomADe_r2.1.1_GRCh38",
      "description": "Genome Aggregation Database exomes r2.1.1 liftover to GRCh38",
      "species": "homo_sapiens",
      "assembly": "GRCh38",
      "type": "local",
      "filename_template": "/path/to/r2.1.1/exomes/gnomad.exomes.r2.1.1.sites.###CHR###.liftover_grch38.vcf.bgz",
      "chromosomes": [
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14","15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"
      ],
      "population_prefix": "gnomADe:",
      "population_display_group": {
        "display_group_name": "gnomAD exomes",
        "display_group_priority": 1.4
      },
      "populations": {
        "9900010": {
          "name": "gnomADe:ALL",
          "_raw_name": "",
          "description": "All gnomAD exomes individuals"
        }
      }
    }
  ]
}

This is all working.

With the release of gnomAD r4.1.0, we would like to update our scripts to using it and replace the old gnomAD VCFs.

My questions are related to (after downloading the gnomAD r.4.1.0 data and adjusting accordingly id,description and filename_template ) how should I adapt the vcf_config.json file?

Thank you in advance, Mike

dglemos commented 2 weeks ago

Hi @MikeHala,

We are working on supporting gnomAD version 4.1 for the next Ensembl release which will be early Autumn. You can check the updates here.

should I keep this line: "use_seq_region_synonyms": 1? (and why there is no equivalent for exomes)

use_seq_region_synonyms is necessary when the VCF file has different names from the regions in the Ensembl db. The gnomAD files should work without this option.

do these matter "display_group_priority": 1.6 (for genomes) and "display_group_priority": 1.4?

display_group_priority retrieves the priority of the group of populations enabling the ordering of frequency data on the population genetics webpage. This should not impact your work.

what will be appropriate values for the populations IDs - currently 9900028 for genomes and 9900010 for exomes?

Yes, for genomes it starts on 9900028 and for exomes on 9900010.

Let me know if you have more questions.

Best wishes, Diana