CDPHE-bioinformatics / CDPHE-SARS-CoV-2

Workflows and scripts for the assembly and analysis of SARS-CoV-2 whole genome tiled amplicon sequencing.
https://cdphe-bioinformatics.github.io/CDPHE-SARS-CoV-2/
GNU General Public License v3.0
5 stars 0 forks source link

Hardcode v2 while we transition to v3 which solves --reference flag error and more #7

Closed danpolanco closed 5 months ago

danpolanco commented 5 months ago

Nextclade removed the --reference flag:

The argument `--reference` (alias `-r`) is removed.

Nextclade datasets are now identified only by their name (`--name`) and, optionally, a version tag (`--tag`). All other attributes are now included into the name.

In order to list all dataset names, type:

nextclade dataset list --names-only

For more information, type

nextclade dataset get --help

Read Nextclade documentation at:

https://docs.nextstrain.org/projects/nextclade/en/stable

https://github.com/CDPHE-bioinformatics/CDPHE-SARS-CoV-2/blob/e73df95c0ba4cdca7c7ea8e6561b70279879e2b9/workflows/SC2_lineage_calling_and_results.wdl#L191

danpolanco commented 5 months ago

I did a fresh install of Nextclade via conda (conda install -c bioconda nextclade) and it isn't as up to date as what we are using in the WDL.

rsv_pipeline) ➜  cdphe-sars-cov-2 git:(fix/nexclade/ref_argument) nextclade dataset list --names-only
error: unexpected argument '--names-only' found

  tip: a similar argument exists: '--name'

Usage: nextclade dataset list <--name <NAME>|--search <SEARCH>>

For more information, try '--help'.

I then noticed that the documentation says "Note that new versions may appear on bioconda with some delay (hours to days). This is due to long submission and approval cycle of bioconda. We recommend using standalone installation or Docker containers for most up-to-date versions."

So for testing, I'm going to use the Docker version.

danpolanco commented 5 months ago

The docker version (docker pull nextstrain/nextclade:latest) also complains the --names-only flag is invalid and has the same version as the conda version (3.0.0).

danpolanco commented 5 months ago

Here is nextclade dataset list: image

Which means we now need to use: --name='nextstrain/sars-cov-2/wuhan-hu-1/proteins'

danpolanco commented 5 months ago

Using the --name='nextstrain/sars-cov-2/wuhan-hu-1/proteins' fixed the download step, but now I can see there are more issues downstream by the change to nextclade 3.0.0:

Traceback (most recent call last):
File "/cromwell_root/terra_workspace_references/covid/nextclade_json_parser.py", line 219, in <module>
extract_variant_list(json_path = nextclade_json, project_name = project_name, workflow_version = workflow_version)
File "/cromwell_root/terra_workspace_references/covid/nextclade_json_parser.py", line 78, in extract_variant_list
gene=item['gene']
KeyError: 'gene'

I'm assuming the nextclade output changed format which broke nextclade_json_parser.py.

That means it might be better to use the previous version of nextclade for this week's WWT results and then fix this issue.

danpolanco commented 5 months ago

There are many more changes mentioned in the Nextclade V3 Migration Guide, so for now, to fix this issue we are going to hardcode Nextclade 2.14.0 into the lineage calling WDL.