chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

Replace cdhit-dup with idseq-dedup #50

Closed morsecodist closed 4 years ago

morsecodist commented 4 years ago

The changes

I replaced cdhit-dup with idseq-dedup. I had to replace it in a few ways:

That covers all of the actual logical changes. In addition to those I also needed to rename a lot because we were referencing cdhit-dup by name all over the place. Everything that is logically tied to idseq-dedup is named after idseq-dedup, for example, parsing the cluster file format. Everything that is related to the concept of duplicate clusters in general I generically called something with duplicate clusters. I felt this should have been done originally as these things are not relevant to cdhit-dup in particular. I renamed the following:

How I made sure I got everything

First I made logical changes and made the necessary changes to the idseq monorepo and ran it to make sure it worked. This is the (v1 compat) tag since this works with the same tag on my commits in the monorepo.

Then I searched the code for cd.?hit ignoring case. Every instance I found, I looked into it's significance, then did an informed replacement of that specific slice. By the end the only match was in the change log in the readme.

Testing

I tested this manually in dev by running manually with run_sfn.py. This requires the changes here to work. Note that the tests there are failing because this hasn't been merged in yet. Once it is those tests will cover this as well.

A note on the monorepo

If the inputs or outputs of stages are changed that is a breaking change to the monorepo. In this case one of the outputs from the host filtering stage (the deduplicated fasta) was removed as inputs to subsequent stages, the first breaking change. The second breaking change was the renaming of the cluster sizes, which remains an output of host filtering and an input to subsequent stages.

Deployment

We will need to do a coordinated deploy with updating the version in the DB to run this. There may be a small amount of downtime (a few seconds).

tfrcarvalho commented 4 years ago

Thanks for the great thorough description! I think the plan sounds fine for this change.

We will need to do a coordinated deploy with updating the version in the DB to run this. There may be a small amount of downtime (a few seconds).

How hard would be to add a flag-guarded alternate path to the workflow? I understand it might be a lot more work, but could avoid downtime, and give us the chance to test and rollback quickly if necessary. It would be great to consider that for the next breaking change.

morsecodist commented 4 years ago

@tfrcarvalho

How hard would be to add a flag-guarded alternate path to the workflow?

IIRC our flags are handled via db updates in the same way as the pipeline version, so I am not quite sure what that would buy us. Are you referring to deploying such that both can be run at once? If so that is unfortunately not possible since the infra can only support either before or after the change.

tfrcarvalho commented 4 years ago

IIRC our flags are handled via db updates in the same way as the pipeline version, so I am not quite sure what that would buy us. We can create flags on the pipeline itself enable/disable (ability to set those could be based on request from the web given web flags, but that is a separate process)... Good thing that you used generic names for the inputs and output since in the future we could just have alternative paths for a certain step...

Are you referring to deploying such that both can be run at once? If so that is unfortunately not possible since the infra can only support either before or after the change.

I am curious why the infrastructure could not be made support both?

I am not saying we go through this process for every change. Also not sure if this specific change would require it since it is a no-op in terms of results. But I think for future improvements of the pipeline it could be interesting to have alternative paths that would allows to test in prod with beta users.