epi2me-labs / wf-artic

ARTIC SARS-CoV-2 workflow and reporting
https://labs.epi2me.io/
Other
49 stars 36 forks source link

Nextclade v2 no longer maintained #107

Closed ammaraziz closed 8 months ago

ammaraziz commented 9 months ago

Operating System

Ubuntu 22.04

Other Linux

No response

Workflow Version

All

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

No response

Workflow Execution - CLI Execution Profile

None

What happened?

Nextclade has been updated to V3.0.0, this update includes a change to the datasets. The v2.X datasets are in archive mode and all updates will be pushed to V3. The current setup for this pipeline is to use V2.14.0.

There are a few other changes that are worth mentioning:

See full details of changes here: https://github.com/nextstrain/nextclade_data/blob/master/docs/migration-guide-v3.md

Without updates to the workflow-glue report and the ONT docker images, all lineage calls will be out dated.

Relevant log output

NA

Application activity log entry

No response

ivan-aksamentov commented 9 months ago

Nextclade developer here. Let me know folks if you need help in the process of upgrade to v3 (in this case, feel free to either mention my nickname loudly or submit issues).

--output-errors is now --output-csv, the output is different that previous output

The format is the same, it's just that --output-csv and --output-tsv contain all possible columns, including the ones related to errors and warnings, and --output-errors only contains the ones that are related to errors and warnings. So if your CSV/TSV processing does not depend on number of columns and their order then the --output-tsv/--output-csv is just a drop-in replacement.

tag.json, qc.json and virus_properties.json got merged into a single file, called pathogen.json

If official datasets are used, then this probably does not matter. As far as I understand, the request was to keep official v2 datasets up-to-date. However I also see that there are copies of v2 datasets stored in this repo. I don't know if they are exact copies or modified in some way or used outside of running nextclade with them. I am not familiar with this project, but let me know if dataset format change in v3 is significant here.

primers.csv was removed.

We were asked to bring back the primers.csv feature. The input format was so bad, we had no idea anyone is actually using it. The comeback hasn't happened yet, but planned in the very near future (~a couple of days to couple of weeks).

ammaraziz commented 9 months ago

If official datasets are used, then this probably does not matter. As far as I understand, https://github.com/nextstrain/nextclade/issues/1397 was to keep official v2 datasets up-to-date. However I also see that there are copies of v2 datasets stored in this repo. I don't know if they are exact copies or modified in some way or used outside of running nextclade with them. I am not familiar with this project, but let me know if dataset format change in v3 is significant here.

Users (i.e. me!) can specify a flag that retrieves the latest version of the SC2 dataset. So as far as I know, they are identical. The only issue is the lack of V2 dataset updates.

Good to know the majority of changes only affect the parsing in a minor way.

ONT folks, if you create a new docker image of the nextclade v3 I can do the testing and submit a pull request to the workflow-glue cli tool.

cjalder commented 9 months ago

Hi Both,

Thanks for highlighting this and offering your help in the matter. We will put it on our development roadmap and hopefully have a fix soon!

corneliusroemer commented 9 months ago

@cjalder Let us know if you have any questions migrating from 2->3, I'm another dev of Nextclade.

The first time there'd be only a v3 dataset release, i.e. the start of V2 being inferior would be in around 2 weeks.

New lineages wouldn't appear - but everything keeps working otherwise. Of course one wants to transition to v3, this is just to allow a severity estimate.

corneliusroemer commented 8 months ago

There's a way you could extend your runway by downloading and overwriting just the tree from v3 and otherwise continue with v2.

This could be as simple as adding the following to the dataset download code:

if $USE_NEXTCLADE_V3_TREE; do
NEXTCLADE_V3_RELEASE_TIMESTAMP="2024-01-16--20-31-02Z" ;
curl https://data.clades.nextstrain.org/v3/nextstrain/sars-cov-2/wuhan-hu-1/orfs/$NEXTCLADE_V3_RELEASE_TIMESTAMP/tree.json > dataset_path/tree.json ;
done

to override the v2 tree with a specified v3 tree.

Nextclade v2 can use that v3 tree without issues.

This could be added (after nextflowification from my bash above) here:

https://github.com/epi2me-labs/wf-artic/blob/5b047ec9a94307f48e1edff2382c41ba59ad11ae/main.nf#L323-L326

corneliusroemer commented 8 months ago

You can see how minimal the required changes are to keep using the v2 binary but keep getting dataset updates in this PR.

I haven't tested the nextflow, but the bash works locally for me: #109

mattdmem commented 8 months ago

Thanks all

We're updating to nextclade v3 in the next release, we'll try to get this out as soon as we can.

Matt

ammaraziz commented 8 months ago

@mattdmem Thank you for the V3 upgrade!