NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 0 forks source link

Set up Daily Manifests for Post Ingest, Continuous, Deletes of MAXAR MCP Delivery Bucket Files #328

Open krisstanton opened 6 months ago

krisstanton commented 6 months ago

Update: The scope of this ticket has been changed to provide manifest files from NGAP to MCP for the purpose of setting up continuous deletion.

General Background: After Cumulus runs, we have the following.

This ticket deals with getting those manifest files from NGAP over to MCP. All processing of this info will be done on MCP by an Archive DAG

Tasks:

Everything below is old / archived / draft while coming up with an approach to solving this item


Preserving the older description again (Immediately Below) for this ticket so that we ensure nothing is missed


This ticket is being repurposed to be in line with using our Data Management System (DMS) in order to automate the process of doing continuous deletions.

The scope of this ticket will be limited to Prototyping work immediately following an agreement on how we are approaching this problem. Reference to Scoping Ticket: https://github.com/NASA-IMPACT/csdap-cumulus/issues/367 Reference to Milestone: https://github.com/NASA-IMPACT/csda-project/issues/651

Please Edit the below steps to include new Items discussed in the approach. Also remove or modify below steps to reflect any changes that come from those discussions.

Steps (DRAFT)


Preserving the older description (Immediately Below) for this ticket so that readily developed tools for this task can be more easily found and referenced


OLD Name: MCP Manifest Analysis: Utility to generate Lists: SAFE_TO_DELETE', and 'STILL_NEED_TO_INGEST' OLD DESCRIPTION BELOW

MCP Manifest Lists Generation

Update: Important Note: We will be deleting triple verified (Earthdata, CBA and CBA ORCA) files from NGAP FIRST, before we delete anything from MCP.

This is a utility that runs AFTER the ORCA Validation Utility AND the NGAP List Generator (this is part 3 of 3 of the Cumulus Pipeline Manifest utilities).

The functionality here is very similar to that found in https://github.com/NASA-IMPACT/csdap-cumulus/issues/319

The purpose of this is to generate two lists: SAFE_TO_DELETE, STILL_NEED_TO_INGEST for both MCP. This is done by examining the MCP Maxar Delivery Bucket Manifest and comparing it with the outputs from the ORCA Validation Utility (https://github.com/NASA-IMPACT/csdap-cumulus/issues/270).

Note: The file names should be the same but the paths to them might be different (I believe the discrepancy is between MCP Maxar Delivery Bucket Paths and the NGAP/CBA Bucket Paths). We may need some additional logic in the utility that compares only parts of the file path (instead of the entire file path) from different manifests to generate these lists.

Files in the MCP: STILL_NEED_TO_INGEST list here will require a run through Airflow (restore and convert xml to cmr) and then Cumulus Files in the NGAP: STILL_NEED_TO_INGEST list here would only require a run through Airflow from the MCP bucket IF the CMR record is missing (there were a small number of these cases). The default case for these files would be to just run the Cumulus ingest rule which covers these items.

krisstanton commented 3 months ago

Update: Important Note (adding this to description as well) - We will be deleting triple verified (Earthdata, CBA and CBA ORCA) files from NGAP FIRST, before we delete anything from MCP.