A Nextflow wrapped workflow for generating the mutation profiles of SARS-CoV-2 genomes (Variants of Concern and Variants of Interest). Workflow is developed in collaboration with COVID-MVP (https://github.com/cidgoh/COVID-MVP) which can be used to visualize the mutation profiles and functional annotations.
[ ] Create two separate JSON files for each virus: 1 to hold ontology names for protein names and symbols, 1 to hold ontology names for gene names and symbols. Include a link to something (figure out what) in the master JSON
[ ] Update functionalannotation.py to generate new DH template format - (generate files for SC2 + MPOX + push to repo)
[ ] Check surveillance report generation script to make sure it works, and change it to be 1 GVF=1 TSV (MRI do this part), and 1 TSV=1 PDF (MZA)
[x] Ask Damion about "Not Applicable" menu
[ ] Review template with Zohaib and Emma when Emma is back (early next week), then submit v1 for release
Changes to JSON
[ ] gene key comes from GFF file; check if it's needed in the workflow, if not, remove it from the JSON and keep vcf_gene in the GVF alone
[x] protein_alias comes from the manually curated key, virus_genomeAnnotation; update this key file to have list values, removing alias names that are the same as the gene name (?) [check with Ivan first to see if he needs protein_alias for the visualization]
[x] change RdRp protein_alias to nsp12
[x] add orf1ab to protein_alias list for orf1b entry
[x] remove pokay_id key, as protein_alias list will contain the Pokay id already
[x] in Pokay itself, rename proteins as needed (eg. Plpro ->PL_pro) to match protein_alias entries MRI: doing this will mess up the way functionalannotation.py separates out the protein names from the functional category, so I'm temporarily going to hardcode this. In the future, Pokay won't use these filenames anymore, and this issue will be solved.
[x] automatically add ontology names from ontology name files to JSON (see 'Top Priority', above)
[ ] Change GVF keys to match corresponding JSON keys product, gene (if still using), and protein_alias, and notify Ivan of changes that will impact the visualization
Future work
[ ] Implement one-to-many DataHarmonizer template functionality to deal with multiple mutations
[ ] Add HGVS format checks in the DH template itself (regex for basic format) and in addfunctions2gvf.py (eg. print log of unmatched names) to make sure mutation names entered by the user match those in our functional annotation database
[ ] Implement a way for a user to add their functional annotations on top of our already-annotated data, at the end of the workflow after a GVF has been created
[ ] Use GitHub Actions to auto-update functional annotation (generate files for SC2 + MPOX) - some code already written, using nonstandardized Pokay terms for now
[ ] Review new category names with Paul
[ ] Embed DataHarmonizer template in VIRUS-MVP website
[ ] Update Pokay to use new standardized names (after approval from Paul and everybody)
Slides from September 9 update are here.
Top priority, to do week of September 9
functionalannotation.py
to generate new DH template format - (generate files for SC2 + MPOX + push to repo)Changes to JSON
gene
key comes from GFF file; check if it's needed in the workflow, if not, remove it from the JSON and keepvcf_gene
in the GVF aloneprotein_alias
comes from the manually curated key,virus_genomeAnnotation
; update this key file to have list values, removing alias names that are the same as the gene name (?) [check with Ivan first to see if he needsprotein_alias
for the visualization]pokay_id
key, as protein_alias list will contain the Pokay id alreadyprotein_alias
entries MRI: doing this will mess up the way functionalannotation.py separates out the protein names from the functional category, so I'm temporarily going to hardcode this. In the future, Pokay won't use these filenames anymore, and this issue will be solved.product
,gene
(if still using), andprotein_alias
, and notify Ivan of changes that will impact the visualizationFuture work
addfunctions2gvf.py
(eg. print log of unmatched names) to make sure mutation names entered by the user match those in our functional annotation database