NASA-PDS / nucleus

Nucleus is a software platform used to create workflows for the Planetary Data (PDS).
https://nasa-pds.github.io/nucleus
Apache License 2.0
0 stars 0 forks source link

Analyze best way for DAG steps to share data #45

Closed tloubrieu-jpl closed 1 year ago

tloubrieu-jpl commented 1 year ago

💡 Description

Options are EFS, S3, other databases

That could managed as:

A good place to start is to get inputs from PODAAC on how they do that in Cumulus.

tloubrieu-jpl commented 1 year ago

airflow has a feature called excom to share data between steps. @ramesh-maddegoda is investigating that.

ramesh-maddegoda commented 1 year ago

Evaluated the following options to share data between Airflow tasks:

  1. EFS
  2. S3
  3. Databases
  4. Airflow XComs (https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/xcoms.html)

EFS:

Pros:

Cons:

S3:

Pros:

Cons:

Databases:

Pros: Higher performance compared to S3. Other features of databases such as transaction management, queries and security can be easily utilized.

Cons:

Airflow XComs:

Pros:

Cons:

Recommendations:

Considering the above pros and cons of each option, we can make following recommendations:

References:

ramesh-maddegoda commented 1 year ago

Additional note on Cumulus:

Had a chat with a Cumulus user and got to know that they use S3 buckets to share data between tasks in a workflow. At the end of each task, the data is uploaded to an S# bucket and then the next task downloads data from the same S3 bucket.

tloubrieu-jpl commented 1 year ago

Agreed during the breakout to move forward with the S3fs solution.