cio-abcd / variantinterpretation

Collaborative Interpretation-Pipeline workflow based on nf-core pipeline structure
MIT License
7 stars 1 forks source link

Set up and configure VEP annotation #1

Closed sci-kai closed 1 year ago

sci-kai commented 1 year ago

Description of feature

The first major feature of the workflow provides variant annotation using ensembl-vep workflow. For this, we should integrate the nf-core module of VEP. The flags and arguments for the vep command should be costumizable. However, for ZPM purposes we should discuss a number of standards arguments to include by default. I suggest the following flags:

  1. -- everything: Summarizes most important flags: --sift b, --polyphen b, --ccds, --hgvs, --symbol, --numbers, --domains, --regulatory, --canonical, --protein, --biotype, --af, --af_1kg, --af_esp, --af_gnomade, --af_gnomadg, --max_af, --pubmed, --uniprot, --mane, --tsl, --appris, --variant_class, --gene_phenotype, --mirna
  2. --check_existing: Adds dbSNP IDs and Clin_SIG fields, e.g. from ClinVar
  3. --no_escape: do not use URL escape signs , e.g., a p.A567= mutation would otherwise be dumped as p.A567%3D
  4. --flag_pick: adds a flag picking only one transcript based on MANE, Canonical status and functional annotation (see here: https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick)
  5. --format vcf & --vcf: Specified VCFs files as standard input and output.
  6. --offline: Enables offline usage, which should be standard procedure and additonally needs the cache specified.
  7. --cache: VEP Cache as input databases.

Further ideas for easy configuration:

biolancer commented 1 year ago

As mentioned during the discussion, it could be potentially helpful to pre-select clinically relevant transcripts based on a predefined reference set as f.e. https://www.lrg-sequence.org/ (Locus Reference Genomic Database).

The database seems to be deprecated since 21th march 2021 and it recommends the following: "Ensembl and RefSeq transcripts that are specified by the MANE collaboration are preferred for all genes where available to help standardise clinical reporting. "

We could thus include a soft-filter routine for clinically relevant transcript based on MANE in a later iteration.

sci-kai commented 1 year ago

see PR #8