Closed morsecodist closed 4 years ago
Thanks for the great thorough description! I think the plan sounds fine for this change.
We will need to do a coordinated deploy with updating the version in the DB to run this. There may be a small amount of downtime (a few seconds).
How hard would be to add a flag-guarded alternate path to the workflow? I understand it might be a lot more work, but could avoid downtime, and give us the chance to test and rollback quickly if necessary. It would be great to consider that for the next breaking change.
@tfrcarvalho
How hard would be to add a flag-guarded alternate path to the workflow?
IIRC our flags are handled via db updates in the same way as the pipeline version, so I am not quite sure what that would buy us. Are you referring to deploying such that both can be run at once? If so that is unfortunately not possible since the infra can only support either before or after the change.
IIRC our flags are handled via db updates in the same way as the pipeline version, so I am not quite sure what that would buy us. We can create flags on the pipeline itself enable/disable (ability to set those could be based on request from the web given web flags, but that is a separate process)... Good thing that you used generic names for the inputs and output since in the future we could just have alternative paths for a certain step...
Are you referring to deploying such that both can be run at once? If so that is unfortunately not possible since the infra can only support either before or after the change.
I am curious why the infrastructure could not be made support both?
I am not saying we go through this process for every change. Also not sure if this specific change would require it since it is a no-op in terms of results. But I think for future improvements of the pipeline it could be interesting to have alternative paths that would allows to test in prod with beta users.
The changes
I replaced
cdhit-dup
withidseq-dedup
. I had to replace it in a few ways:idseq-dedup
in the dockerfile (short-read-mngs/Dockerfile
)cdhit-dup
pipeline step with a pipeline step that runsidseq-dedup
(short-read-mngs/idseq-dag/idseq_dag/steps/run_cdhitdup.py
/short-read-mngs/idseq-dag/idseq_dag/steps/run_idseq_dedup.py
)cdhit-dup
to code that parses theidseq-dedup
output format (short-read-mngs/idseq-dag/idseq_dag/util/cdhit_clusters.py
/short-read-mngs/idseq-dag/idseq_dag/util/idseq_dedup_clusters.py
)That covers all of the actual logical changes. In addition to those I also needed to rename a lot because we were referencing
cdhit-dup
by name all over the place. Everything that is logically tied toidseq-dedup
is named afteridseq-dedup
, for example, parsing the cluster file format. Everything that is related to the concept of duplicate clusters in general I generically called something with duplicate clusters. I felt this should have been done originally as these things are not relevant tocdhit-dup
in particular. I renamed the following:cdhitdup_cluster_sizes
. We also had a lot of functions in (short-read-mngs/idseq-dag/idseq_dag/util/count.py
) related to creating, saving, and loading these files. I renamed all of these files and functions to names containingduplicate_cluster_sizes
. This includes the names of wdl inputs. (note: this is a breaking change with the idseq monorepo will explain why below)cdhit-dup
in the example dag jsons. I was on the fence about doing those since they are no longer relevant but if they are still there, and technically could still be run I felt it was best to update them.How I made sure I got everything
First I made logical changes and made the necessary changes to the idseq monorepo and ran it to make sure it worked. This is the
(v1 compat)
tag since this works with the same tag on my commits in the monorepo.Then I searched the code for
cd.?hit
ignoring case. Every instance I found, I looked into it's significance, then did an informed replacement of that specific slice. By the end the only match was in the change log in the readme.Testing
I tested this manually in dev by running manually with
run_sfn.py
. This requires the changes here to work. Note that the tests there are failing because this hasn't been merged in yet. Once it is those tests will cover this as well.A note on the monorepo
If the inputs or outputs of stages are changed that is a breaking change to the monorepo. In this case one of the outputs from the host filtering stage (the deduplicated fasta) was removed as inputs to subsequent stages, the first breaking change. The second breaking change was the renaming of the cluster sizes, which remains an output of host filtering and an input to subsequent stages.
Deployment
We will need to do a coordinated deploy with updating the version in the DB to run this. There may be a small amount of downtime (a few seconds).