NASA-IMPACT / veda-data-airflow

Airflow implementation of ingest pipeline for VEDA STAC data
Other
10 stars 4 forks source link

Pass discovery output to next step via s3 #98

Closed slesaad closed 8 months ago

slesaad commented 2 years ago

Motivation

The max size of the payload that can be passed between states in a step function is 256KB. Sometimes, when the number of items discovered is too many (s3-discovery lambda), the payload size exceeds the threshold and results in the cancellation of the state machine.

Workaround

The workaround we've been using up until now is using the filename_regex key to divide the total items to chunks and running separate workflows for each chunk. Eg:

    {
        "collection": "co2-diff",
        "prefix": "co2-diff/",
        "bucket": "veda-data-store-staging",
        "filename_regex": "^(.*)2015.*.tif$",
        "discovery": "s3"
    },
    {
        "collection": "co2-diff",
        "prefix": "co2-diff/",
        "bucket": "veda-data-store-staging",
        "filename_regex": "^(.*)2016.*.tif$",
        "discovery": "s3"
    },

instead of

    {
        "collection": "co2-diff",
        "prefix": "co2-diff/",
        "bucket": "veda-data-store-staging",
        "filename_regex": "^*.tif$",
        "discovery": "s3"
    },

Solution

Rather than passing the payload directly to another state, we could write the payload to an s3 bucket, pass only the URL of the object and the next state would read the object from the s3 bucket.

gadomski commented 1 year ago

From this line https://github.com/NASA-IMPACT/veda-data-airflow/blob/4c96eb27b112521be9a0401cf5316e1f6fb11837/dags/veda_data_pipeline/utils/s3_discovery.py#L252 I think this issue is still valid, so I'm going to transfer to veda-data-airflow. @moradology please correct me if I'm wrong.

ividito commented 1 year ago

This should be less of an issue now, but still worth considering for refactors. In the new Airflow pipeline, the max XCOM size is 1 GB. I don't think we've run into this issue since making the switch.

slesaad commented 1 year ago

in airflow, the payload to the next step is passed via a file in s3 already!