NASA-PDS / nucleus

Nucleus is a software platform used to create workflows for the Planetary Data (PDS).
https://nasa-pds.github.io/nucleus
Apache License 2.0
0 stars 0 forks source link

Minimize EFS use of Nucleus to reduce the cost and to avoid DataSync related complications #96

Closed ramesh-maddegoda closed 7 months ago

ramesh-maddegoda commented 8 months ago

💡 Description

At the moment, we use EFS in Nucleus instead of S3, because,

As a result we introduced, lots of components to determine if we have recived a PDS product completely. This increases our costs in addition to the EFS cost.

However, if the Nucleus DAGs can copy data from S3 staging bucket to EFS at the time of DAG processing (and delete data at the end of the DAG), then we can reduce our cost.

The proposed approach:

  1. When DUM tool copied files to S3 staging bucket, it creates S3 events.
  2. A lambda will detect these S3 events and insert records to the database tables (same tables that we already have) - No DataSync or EFS involved
  3. Another lambda (we already have this) will query the database, detect completed products and trigger Nucleus DAGs. The challenge here would be handling .fz files. There should be a logic to consider .fz as part of the completed product.
  4. Then DAG will receive a batch (manifest) of product labels to be processed.
  5. The first task in the DAG (a new Docker container to be developed), will look at the manifest and copy files from S3 staging bucket to EFS.
  6. The Nucleus DAG will read data from EFS and execute the workflow.
  7. At the end of the workflow, the files on the EFS will be deleted.

⚔️ Parent Epic / Related Tickets

tloubrieu-jpl commented 8 months ago

Lambda has been updated to update the database from the s3 events. Remains to create docker images to move the files from S3 to EFS.

tloubrieu-jpl commented 7 months ago

@ramesh-maddegoda is making progress on that task. He now can clean up the file on EFS. Now Ramesh is trying to integrate FunPack in a docker container.

tloubrieu-jpl commented 7 months ago

The docker container can use funpack and Ramesh has integrated it to the nucleus workflow.

The workflow worked well on 3 messenger directory.

After cleaning the EFS, the directory structure remains because we don't know if something is writing on it.

Ramesh will now plug harvest which will write on the SBN production Opensearch on JPL AWS.

jordanpadams commented 7 months ago

Obsolete per architecture improvements