Replace cdhit-dup with idseq-dedup

morsecodist commented 4 years ago

The changes

I replaced cdhit-dup with idseq-dedup. I had to replace it in a few ways:

Install idseq-dedup in the dockerfile (short-read-mngs/Dockerfile)
Replace the cdhit-dup pipeline step with a pipeline step that runs idseq-dedup (short-read-mngs/idseq-dag/idseq_dag/steps/run_cdhitdup.py / short-read-mngs/idseq-dag/idseq_dag/steps/run_idseq_dedup.py)
Replace the code we used to parse the clusters outputted by cdhit-dup to code that parses the idseq-dedup output format (short-read-mngs/idseq-dag/idseq_dag/util/cdhit_clusters.py / short-read-mngs/idseq-dag/idseq_dag/util/idseq_dedup_clusters.py)
The cluster parsing code now only requires the output cluster file, while before it required the cluster output file and one of the deduped outputs from the cdhit-dup step. This output was added as an extra input to steps that needed these clusters. I removed this input from all wdl files. (note: this is a breaking change with the idseq monorepo will explain why below)

That covers all of the actual logical changes. In addition to those I also needed to rename a lot because we were referencing cdhit-dup by name all over the place. Everything that is logically tied to idseq-dedup is named after idseq-dedup, for example, parsing the cluster file format. Everything that is related to the concept of duplicate clusters in general I generically called something with duplicate clusters. I felt this should have been done originally as these things are not relevant to cdhit-dup in particular. I renamed the following:

A lot of steps use an output file that we generate called cdhitdup_cluster_sizes. We also had a lot of functions in (short-read-mngs/idseq-dag/idseq_dag/util/count.py) related to creating, saving, and loading these files. I renamed all of these files and functions to names containing duplicate_cluster_sizes. This includes the names of wdl inputs. (note: this is a breaking change with the idseq monorepo will explain why below)
Renamed everything to do with cdhit-dup in the example dag jsons. I was on the fence about doing those since they are no longer relevant but if they are still there, and technically could still be run I felt it was best to update them.

How I made sure I got everything

First I made logical changes and made the necessary changes to the idseq monorepo and ran it to make sure it worked. This is the (v1 compat) tag since this works with the same tag on my commits in the monorepo.

Then I searched the code for cd.?hit ignoring case. Every instance I found, I looked into it's significance, then did an informed replacement of that specific slice. By the end the only match was in the change log in the readme.

Testing

I tested this manually in dev by running manually with run_sfn.py. This requires the changes here to work. Note that the tests there are failing because this hasn't been merged in yet. Once it is those tests will cover this as well.

A note on the monorepo

If the inputs or outputs of stages are changed that is a breaking change to the monorepo. In this case one of the outputs from the host filtering stage (the deduplicated fasta) was removed as inputs to subsequent stages, the first breaking change. The second breaking change was the renaming of the cluster sizes, which remains an output of host filtering and an input to subsequent stages.

Deployment

We will need to do a coordinated deploy with updating the version in the DB to run this. There may be a small amount of downtime (a few seconds).

tfrcarvalho commented 4 years ago

Thanks for the great thorough description! I think the plan sounds fine for this change.

We will need to do a coordinated deploy with updating the version in the DB to run this. There may be a small amount of downtime (a few seconds).

How hard would be to add a flag-guarded alternate path to the workflow? I understand it might be a lot more work, but could avoid downtime, and give us the chance to test and rollback quickly if necessary. It would be great to consider that for the next breaking change.

morsecodist commented 4 years ago

@tfrcarvalho

How hard would be to add a flag-guarded alternate path to the workflow?

IIRC our flags are handled via db updates in the same way as the pipeline version, so I am not quite sure what that would buy us. Are you referring to deploying such that both can be run at once? If so that is unfortunately not possible since the infra can only support either before or after the change.

tfrcarvalho commented 4 years ago

IIRC our flags are handled via db updates in the same way as the pipeline version, so I am not quite sure what that would buy us. We can create flags on the pipeline itself enable/disable (ability to set those could be based on request from the web given web flags, but that is a separate process)... Good thing that you used generic names for the inputs and output since in the future we could just have alternative paths for a certain step...

Are you referring to deploying such that both can be run at once? If so that is unfortunately not possible since the infra can only support either before or after the change.

I am curious why the infrastructure could not be made support both?

I am not saying we go through this process for every change. Also not sure if this specific change would require it since it is a no-op in terms of results. But I think for future improvements of the pipeline it could be interesting to have alternative paths that would allows to test in prod with beta users.

chanzuckerberg / idseq-workflows