Closed sjfleming closed 2 years ago
@carmendv
I agree, I think is best to have 2 folders in the base directory: utils
and pipelines
.
Some thoughts about the folder pipeline
:
cellpainting
: yes, and it is accurate to call it like that :)mining
: question: right now we have the workflow and task code in the same WDL. Do you suggest to put the task in the utils WDL? or have more than one utils WDL's? we can have a separate utils just for cytomining, since it is a beefy one. cellprofiler
: would this be the same WDL as cpd_analysis_pipeline.wdl
? We can create a separate one, generic, and having it accept the same inputs as cellprofiler from the command line. However I would call it cellprofiler_distributed
, I don't see an easy way to make the scatter optional. cellprofiler_singleVM
: same than cellprofiler but in singleVM.
@sjfleming let me knowOkay cool! We can also have a scripts
folder I guess, to include the scripts you've written (and to make our docker build actually work correctly... I didn't realize those were not in the repo yet). We should also add any attribution to people who wrote parts of those scripts into the readme, if some of it is from the Carpenter lab.
cellpainting
: great!mining
: I am okay leaving the task in the same WDL just because the whole workflow just calls that one task, and nothing else calls it. But we can do it either way! And if you think some other workflow might call that task too, then it would be a good idea.cellprofiler
: cool, yes, I think you're right... maybe optional scattering is kind of annoying. I think it can be done if we wanted though with an if
statement. The WDL might look a bit less elegant, but it would all be in one place.cellprofiler_singleVM
: I agree with doing this if we do not want to have scattering optional
Our current split into a "single VM" pipeline versus a "distributed" pipeline reflects our historical development, but I am not sure it is the best way to explain these pipelines to others.
Proposal:
utils
: this will have the WDL with the common utility taskspipelines
cellprofiler
,cell_painting
, andmining
cellprofiler
: this should still be distributed (scattered), but should just be the simple single task of running one cppipe file on data. (The WDL will call the utility task.)cell_painting
: this is the full pipeline for analyzing Cell Painting data, end to end, starting with images and ending with feature aggregation. This is what's currently the "distributed" pipeline. Is it accurate to call it a "Cell Painting pipeline"? (All these WDLs call sub-tasks from the utilities WDL.)mining
: it's probably worth keeping this around as a separate "pipeline" because maybe someone will want to use it as a stand-alone step (even if they didn't use the pipelines in this repo for the first part). (The WDL will call a utility task for cytomining: the same task called by the cytomining workflow that's part of the Cell Painting pipeline. No code duplication.)Each of the subfolders (
cellprofiler
,cell_painting
, andmining
) will have its own README with documentation.Any thoughts?