Update: The scope of this ticket has been changed to provide manifest files from NGAP to MCP for the purpose of setting up continuous deletion.
General Background:
After Cumulus runs, we have the following.
files in the CBA PROD bucket(s) (Protected and Public)
files in the CBA PROD DR bucket (Archive) (Note, these files are sourced from BOTH Protected and Public CBA Prod buckets and kind of smashed together)
CMR Data published on Earthdata
External Metrics confirmations
S3 Bucket directory and file listing can be acquired by manifest lists generated by S3.
This ticket deals with getting those manifest files from NGAP over to MCP. All processing of this info will be done on MCP by an Archive DAG
Tasks:
[ ] Send the Manifest files from NGAP (CBA PROD) to the MCP Bucket, ss-s3-inventories-uswest2 in the correct subdirectory
[ ] Make sure lifecycle is set to auto remove these files in 3 months
[ ] Send the Manifest files from ORCA (CBA PROD DR) to the MCP Bucket, ss-s3-inventories-uswest2 in the correct subdirectory
[ ] Make sure lifecycle is set to auto remove these files in 3 months
Everything below is old / archived / draft while coming up with an approach to solving this item
Preserving the older description again (Immediately Below) for this ticket so that we ensure nothing is missed
This ticket is being repurposed to be in line with using our Data Management System (DMS) in order to automate the process of doing continuous deletions.
Please Edit the below steps to include new Items discussed in the approach.
Also remove or modify below steps to reflect any changes that come from those discussions.
Steps (DRAFT)
[ ] Create a python process which can be easily converted or adapted to be a DAG which does the following
[ ] Examine the data from the perspective of specific granules
[ ] Makes CMR Request for a specific Granule
[ ] Validates the source of S3 paths to actually be pointing to CBA PROD (Remember, some of the published data still points to OLD NGAP)
[ ] Makes request to read data in CBA PROD Buckets and CBA DR PROD (ORCA) Buckets
[ ] Uses the data returned to determine if an individual granule (or batch of granules) exists in all 3 places
[ ] Process Manifests (Possible Scaling Problem here??)
[ ] Read a batch of files from CBA PROD's Manifest
[ ] Verify those granules (collection of files) exist in ORCA (CBA PROD DR account)
[ ] Verify those granules are published in the CMR Record with correct paths.
Note: Some of this code already exists, but it may need to be set up to process in batches (so new controlling structure in the code may be needed).
Preserving the older description (Immediately Below) for this ticket so that readily developed tools for this task can be more easily found and referenced
OLD Name: MCP Manifest Analysis: Utility to generate Lists: SAFE_TO_DELETE', and 'STILL_NEED_TO_INGEST'
OLD DESCRIPTION BELOW
MCP Manifest Lists Generation
Update: Important Note: We will be deleting triple verified (Earthdata, CBA and CBA ORCA) files from NGAP FIRST, before we delete anything from MCP.
This is a utility that runs AFTER the ORCA Validation Utility AND the NGAP List Generator (this is part 3 of 3 of the Cumulus Pipeline Manifest utilities).
The purpose of this is to generate two lists: SAFE_TO_DELETE, STILL_NEED_TO_INGEST for both MCP.
This is done by examining the MCP Maxar Delivery Bucket Manifest and comparing it with the outputs from the ORCA Validation Utility (https://github.com/NASA-IMPACT/csdap-cumulus/issues/270).
Note: The file names should be the same but the paths to them might be different (I believe the discrepancy is between MCP Maxar Delivery Bucket Paths and the NGAP/CBA Bucket Paths). We may need some additional logic in the utility that compares only parts of the file path (instead of the entire file path) from different manifests to generate these lists.
Files in the MCP: STILL_NEED_TO_INGEST list here will require a run through Airflow (restore and convert xml to cmr) and then Cumulus
Files in the NGAP: STILL_NEED_TO_INGEST list here would only require a run through Airflow from the MCP bucket IF the CMR record is missing (there were a small number of these cases). The default case for these files would be to just run the Cumulus ingest rule which covers these items.
Update: Important Note (adding this to description as well) - We will be deleting triple verified (Earthdata, CBA and CBA ORCA) files from NGAP FIRST, before we delete anything from MCP.
Update: The scope of this ticket has been changed to provide manifest files from NGAP to MCP for the purpose of setting up continuous deletion.
General Background: After Cumulus runs, we have the following.
This ticket deals with getting those manifest files from NGAP over to MCP. All processing of this info will be done on MCP by an Archive DAG
Tasks:
ss-s3-inventories-uswest2
in the correct subdirectoryss-s3-inventories-uswest2
in the correct subdirectoryEverything below is old / archived / draft while coming up with an approach to solving this item
Preserving the older description again (Immediately Below) for this ticket so that we ensure nothing is missed
This ticket is being repurposed to be in line with using our Data Management System (DMS) in order to automate the process of doing continuous deletions.
The scope of this ticket will be limited to Prototyping work immediately following an agreement on how we are approaching this problem. Reference to Scoping Ticket: https://github.com/NASA-IMPACT/csdap-cumulus/issues/367 Reference to Milestone: https://github.com/NASA-IMPACT/csda-project/issues/651
Please Edit the below steps to include new Items discussed in the approach. Also remove or modify below steps to reflect any changes that come from those discussions.
Steps (DRAFT)
Preserving the older description (Immediately Below) for this ticket so that readily developed tools for this task can be more easily found and referenced
OLD Name: MCP Manifest Analysis: Utility to generate Lists: SAFE_TO_DELETE', and 'STILL_NEED_TO_INGEST' OLD DESCRIPTION BELOW
MCP Manifest Lists Generation
Update: Important Note: We will be deleting triple verified (Earthdata, CBA and CBA ORCA) files from NGAP FIRST, before we delete anything from MCP.
This is a utility that runs AFTER the ORCA Validation Utility AND the NGAP List Generator (this is part 3 of 3 of the Cumulus Pipeline Manifest utilities).
The functionality here is very similar to that found in https://github.com/NASA-IMPACT/csdap-cumulus/issues/319
The purpose of this is to generate two lists: SAFE_TO_DELETE, STILL_NEED_TO_INGEST for both MCP. This is done by examining the MCP Maxar Delivery Bucket Manifest and comparing it with the outputs from the ORCA Validation Utility (https://github.com/NASA-IMPACT/csdap-cumulus/issues/270).
Note: The file names should be the same but the paths to them might be different (I believe the discrepancy is between MCP Maxar Delivery Bucket Paths and the NGAP/CBA Bucket Paths). We may need some additional logic in the utility that compares only parts of the file path (instead of the entire file path) from different manifests to generate these lists.
Files in the MCP: STILL_NEED_TO_INGEST list here will require a run through Airflow (restore and convert xml to cmr) and then Cumulus Files in the NGAP: STILL_NEED_TO_INGEST list here would only require a run through Airflow from the MCP bucket IF the CMR record is missing (there were a small number of these cases). The default case for these files would be to just run the Cumulus ingest rule which covers these items.