Add a script, capability, or process so that we can validate all data is backed up to ORCA

krisstanton commented 9 months ago

Ref: https://docs.google.com/spreadsheets/d/1Xznva0Upb9W9bTqdAndV4pDcwjV8jRRYKFVmgq_9i1c/edit#gid=0 Ref: 18.4 The system shall validate all data is backed up to ORCA Come up with a process which does the following.

[x] Access a regularly generated manifest file for Cumulus_CBA's Prod's S3 where all the granules live.
[x] Access a regularly generated manifest file for ORCA_CBA_Prod (Disaster Recovery Account) S3 where all the backed up granules live.
[x] Compare the two manifests
[x] Make a List of any files that exist in only 1 of the manifests and not both.
[x] Generate a report which includes the above list and the count (number of items in that list - at the file level)
- [ ] If needed, do this at the granule level
[x] Ensure this can be output as either report text and JSON (incase we need a machine interface for this) This may possibly require some granule structure logic, but perhaps not.
[ ] Encapsulate the above into a new util repo
- [ ] Add documentation to the new repo explaining how to use it (including specific python environment setup instructions).

krisstanton commented 7 months ago

WIP Update Background:

We have a manifest setup for Cumulus CBA PROD (5047) (the main CBA Prod account where data is ingested)
We have another manifest setup for Cumulus CBA DR PROD (1741) (the ORCA account where CBA Prod data is backed up)
We also have an MCP bucket where our 'ready for Airflow and then Cumulus Ingest' granules live.

Currently, the Orca Validation code is a python script which does the following,

    -Reads the manifest files from the current (or last UTC Now date) to get the most recent manifest
    -Complies a list of the 'keys' (the file names)
    -Runs a function to generate a report which is the result of comparing these lists.
        -The following outputs are generated:
            'unique_to_list1': unique_to_list1, # // Files that are only found in the CBA Prod Manifest
            'unique_to_list2': unique_to_list2, # // Files that are only found in the CBA Prod DR (ORCA) Manifest
            'common_to_both': common_to_both, # // Files that are found in both places
            'count_unique_to_list1': len(unique_to_list1), # // Number of files found in CBA Prod Manifest
            'count_unique_to_list2': len(unique_to_list2), # // Number of files found in CBA Prod DR (ORCA) Manifest
            'count_common_to_both': len(common_to_both), # // Number of files found in BOTH CBA Prod and CBA Prod DR (ORCA) Manifests
    -This entire output (including all the file lists) are saved as json in a local directory each time this utility is run
        -Note: This .json file can be very large.  For a small run, this file was about 1.6 GB

Here is what still needs to happen - and are the next steps

    -Integrate this code into the CSDA Cumulus code base as a utility (so it's in our code repos and easy to use)
    -Add the ability to compare to the MCP buckets (so we can know if any given file exists in all 1, 2 or all 3 locations)
    -Test to make sure the output is correct and acceptable for use (we will end up using this to determine when we can remove files from the NGAP accounts as we near completion of our migration ingests)
        -I may need to add an ignore list for certain prefixed subdirectories when gathering keys from the manifest
    -On an early test, I saw a large number of files (about 10%, which at this time is about 1.4 million files) that are ONLY found in the CBA DR (ORCA) manifest and not the CBA Prod.
        -I need to find out why find out what those files are and why they are only in the ORCA bucket and not the normal PROD bucket.
        -Code needs to be cleaned up a bit and have extra, unused functions removed.

krisstanton commented 6 months ago

WIP Update: The discrepancy of files is because the thumbs are stored in the CBA Public bucket.

It turns out the Thumbfiles are backed up to the ORCA Archive bucket from the public bucket. So the result is that files from both the CBA Protected and CBA Public bucket get merged into the ORCA Archive bucket (with their key paths preserved - resulting in files from 2 separate buckets ending up in the same 'directory'). This means that I need to combine the 2 manifests from CBA Protected and CBA Public into a single list in order to get an accurate compare against the ORCA Bucket.

I made a new manifest on the cba prod protected bucket for getting the thumbs manifest list (CBA Public Bucket) and merging it with the manifest that represents the other files (CBA Protected Bucket).

Next on the list is to clean the code a bit and put into the Cumulus repo on a new branch.

krisstanton commented 3 weeks ago

Closing this as done because there was a lot of work done on this ticket for the One-Off file deletions, but we are now changing course to make this process continuous, which means some of this work is reusable, but enough of it is new for us to make a new ticket for it.

NASA-IMPACT / csdap-cumulus

Add a script, capability, or process so that we can validate all data is backed up to ORCA #270