CDPHE-bioinformatics / CDPHE-SARS-CoV-2

Workflows and scripts for the assembly and analysis of SARS-CoV-2 whole genome tiled amplicon sequencing.
https://cdphe-bioinformatics.github.io/CDPHE-SARS-CoV-2/
GNU General Public License v3.0
5 stars 0 forks source link

[FEATURE] Deprecate nextclade_json_paser.py and use VCF files for Mutations Tracking #23

Open molly-hetheringtonrauth opened 2 months ago

molly-hetheringtonrauth commented 2 months ago

Feature Request

As a group we have decided to deprecate the "nextclade_json_parser.py" script because we can get the same information from vcf files. This has several effects described below.

Solution

1) Remove the nextclade json parser task from the "lineage_calling_and_results.wdl" 2) ONT assembly - currently the medaka task outputs a vcf file without AA annotations. We can use SnpEff to annotate the vcf file with AAs. We maybe want to convert the annotated vcf file to a tsv, if possible. 3) We will want to be able to pull out the S gene mutations from the vcf/tsv file for number 3 below. 4) "summary.py" - currently, we are pulling the clade, total amino acid substitutions etc. from the"nextclade_results.csv" output from the nextclade json parser and pulling the S gene mutations out of the "nextclade_variants.csv" file output form the nextclade json parser. We can pull the clade, total amino acid substitutions, etc. out of the "nextclade.csv" file output from the nextclade task itself. We will need to decide how best to recreate the s gene mutations column.

Upstream effects

None outside of the lineage_calling_and_results.wdl

Downstream effects

If any column changes in the summary_results output file then this could have implications for BigQuery and the Covid QC notebook.