cio-abcd / variantinterpretation

Collaborative Interpretation-Pipeline workflow based on nf-core pipeline structure
MIT License
7 stars 1 forks source link

Add custom database annotations #22

Open sci-kai opened 1 year ago

sci-kai commented 1 year ago

Description of feature

Currently the workflow uses the VEP cache for annotation, but should also enable using local files from databases for annotation. Hence, other database sources can be added and considered. This is also important for using up-to-date databases, as the VEP cache is updated less frequent than the used databases (e.g. ClinVar receives monthly updates). The user should be able to add databases though configuration options of the workflow. This allows user-specific settings and compatibility with local databases. Some databases may also be integrated by direct addressing, e.g., enabling the usage of a local ClinVar copy with the "--clinvar" flag pointing to this path.

VEP itself allows the specification of custom databases (see https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html) which probably fits well in the current workflow structure. Alternatively, vcfanno also annotated from custom databases (https://github.com/brentp/vcfanno)

In future, it would be nice to have a subworkflow that can automatically download and update some specific databases to reduce workload in manual updating. This should be an optional and separate workflow step to ensure offline-usage of the workflow.

sci-kai commented 2 months ago

Just to remember when implementing this feature (occurred to me in another project): vembrane can process fields in mathematical expression, e.g., dividing the QUAL value by 10 with the statement QUAL/10. Thats why create_vembrane_fields.py handles numbers in CSQ string names as such expressions and preserves them when converting to vembrane expression. Hence custom annotation names, which can be freely chosen with vep, can NOT contain numbers in order to run with the pipeline (e.g. 1000genomes should be thousandgenomes instead). Interestingly, the VEP cache does not contain any numbers in the CSQ string names, following this convention already.