Climate-CAFE / Climate-CAFE.github.io

CAFE Public Data Standard Operating Procedure Website for Dataverse and GitHub
https://climate-cafe.github.io/
3 stars 2 forks source link

Any recomendations how to to orchestrate multiple pipelines? #11

Closed ran-codes closed 4 months ago

ran-codes commented 6 months ago

Hi Great. resource. I was reading 'https://climate-cafe.github.io//template.html' and I was wondering if CAFE has any recomendations for how an organization can orchestrate/track pipelines that can take place within a project such as CAFE?

Thanks

audiracmichelle commented 6 months ago

If I am understanding the question correctly, you are interested in learning about workflow languages that "orchestrate" the steps of a pipeline? Snakemake is a very powerful tool. You can read more about this tool in this paper https://doi.org/10.12688/f1000research.29032.2

Let me know if this helps!

ran-codes commented 6 months ago

Thanks for the quick reply! Snakmake looks awesome reminds me a bit of the Targets R Package!

But I was looking for somethign a little different. Like imagine a project that has multiple pipelines. if there was a good tool to organize how multiple pipelines fit together. Lets say pipeline A generates some data that pipeline/manuscript-B uses and how to track these dependencies. Our project uses R a lot fo these individual pipelines but looking for a recomendations how to manage multiple pipelines. Apprecaite your insights = )

audiracmichelle commented 6 months ago

Can you provide details of what is the tooling in pipeline A? Is it containerized command lines calling Rscripts? Do the pipelines depend on the output of the other?

ran-codes commented 6 months ago

We are still in teh design pahse of the project. We define a pipeline is that its a Quarto notebook (R) that takes inputs (files/API's) and generates outputs (files) saved on to storage locations (remote or shared drives).

Yes theoreticall the different notebooks/pipelines will depend on outputs of upstream pipelines.

audiracmichelle commented 6 months ago

Snakemake does exactly what you describe: pipeline A generates some data that pipeline/manuscript-B uses and how to track these dependencies.

Snakemake can manage very complex Direct Acyclic Graph DAG of steps, each step generating a set of files that depend on each other.

audiracmichelle commented 6 months ago

a snakefile pseudocode looks like this:

rule all:
   output: /path/to/final/output/
   input: /path/to/dependencies

rule step 1
   output: /path/to/step 1/output/
   command: `Rscript rule_step_1.R`

rule step 2
   output: /path/to/step 2/output/
   input: /path/to/step 1/output
   command: `Rscript rule_step_2.R`
audiracmichelle commented 6 months ago

Other options are airflow and docker compose. Hope this helps @ran-codes!

ran-codes commented 6 months ago

Yup. Thank you for your insights. Definitely will look into snake make!

Feel free to close! Great resource btw. I am from Drexel U. we are also a P20 funded climate center. Hope we get to work together maybe in the future.

audiracmichelle commented 5 months ago

I hope snakemake is what you were looking for needs. Keep us posted and absolutely interested in collaborations. Closing this issue now!