broadinstitute / cellprofiler-on-Terra

Run CellProfiler on Terra. Contains workflows that enable a full end-to-end Cell Painting pipeline.
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

Consider being agnostic to the backend (currently google cloud) #41

Open sjfleming opened 2 years ago

sjfleming commented 2 years ago

(Only if people actually want / need this. But I assume some people might. I think the Imaging Platform stores a lot of data on AWS.)

Supposedly Terra will be supporting multiple backends (GCP, AWS, Azure) in the near future. All of our "gsutil" commands (which kind of break the usual WDL logic) only work on GCP.

We should think about whether we can do everything strictly in WDL, without any gsutil commands. Or whether we can have separate sorts of "cloud file copying" commands for separate backends, calling the right ones where appropriate.

lynnlangit commented 1 year ago

If there is a need for this on AWS or Azure, I would be interested in contributing to this work.

sjfleming commented 1 year ago

Interesting that you should say that @lynnlangit ! We recently had a request to help make this work on AWS (from a AWS solutions architect working on Amazon Omics). We don't have many internal Broad users wanting this at the moment, but a lot of the Imaging Platform does their work on AWS (not using Terra or Cromwell). And, institutionally, there is a push within Broad's Data Sciences Platform currently to get workflows up and running on Azure due to a collaboration with Microsoft.

So we would welcome any contribution you'd be interested in making!

I will mention though: we actively use the current google backend to analyze data, so we want to ensure that part doesn't break / change too much... I think the best path forward is probably to

have separate sorts of "cloud file copying" commands for separate backends

even though this is not the way WDL is supposed to work. But we are open to other opinions! (If we could write one set of WDLs that are really agnostic to the backend, that would be fantastic. The reason we didn't do that at the outset is that there are just so many individual input files - images - involved. There are several ways we could get around this though...)

I also don't think I have a way to test workflows on AWS personally. It would be easier for us to test (using Terra) workflows on Azure, since "Terra on Azure" is now live. I don't really know how I'd review PRs for something running on AWS until I can figure out how to test it...

@carmendv @deflaux

lynnlangit commented 1 year ago

@sjfleming - thanks for the info - fyi...

Given this - what is the next step on this project?

deflaux commented 1 year ago

It's great to hear that you would like to contribute @lynnlangit !

Regarding next steps:

deflaux commented 1 year ago

@lynnlangit we've completed:

  1. creating a branch for multi-cloud pull request contributions and testing
  2. choosing some specific data for testing and validating the results on GCP, along with corresponding inputs.json files for AWS and GCP.
  3. filing some GitHub issues with concrete suggestions for next steps for one possible way to make these workflows multi-cloud

Is there any other information we can provide to you at this time? Thank you!