NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 1 forks source link

Make copies of PSScene3Band thumbnails #305

Closed chuckwondo closed 3 months ago

chuckwondo commented 1 year ago

The granules in the PSScene3Band collection contain thumbnail files with a _thumb suffix, but Aaron wants us to duplicate the thumbnail files as "browse" files, like we have with our Maxar granules, where the thumbnail file goes to the public bucket and the "browse" file goes to the protected bucket (so the "browse" file is in the list of downloadable files for a granule).

One possible option for meeting the following acceptance criteria is to use AWS Batch with a Lambda function. This would also require generating an S3 inventory for the objects in s3://csdap-cumulus-prod-protected/planet/PSScene3Band/, and filtering for *_thumb files (filtering might be possible as part of the batch job config, but if not, the logic can be added to the Lambda function). (Update: Verified that S3 Inventory prefix parameter cannot handle the wild cards or filtering)

UPDATE (Sprint 4, March 2024): Source Bucket is the 5982 protected bucket

Note: This is all prep for being able to do the migration ingest which will use cumulus to move files from 5982 to 5047

SubTasks

Original Steps

Acceptance criteria

chuckwondo commented 9 months ago

Here's an example Lambda function that works with AWS Batch: https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-invoke-lambda.html

krisstanton commented 3 months ago

WIP Update:

Number of Total Planet Files:                39,921,082
Number of "_thumb" files from above file list:    2,361,441
krisstanton commented 3 months ago

Here is the Custom Lambda used in the S3 Batch Copy Operation Note: I left all the commented code to show the original testing of this functionality.

import json
import boto3
import sys

s3 = boto3.client('s3')

# Setting this to True will create significant output to Cloudwatch Logs
SETTING__IS_OUTPUT_DEBUG_MODE = False

# When we need to print the output to the logs to see what is going on
def debug_print(str_to_print="", obj_out=None):
    if(SETTING__IS_OUTPUT_DEBUG_MODE == True):
        print(f'{str_to_print}:    {obj_out}')

def lambda_handler(event, context):
    run_did_fail = False
    err_info = ""
    success_info = ""
    try:
        #print('SETTING__IS_OUTPUT_DEBUG_MODE is set to False')

        #print(f'Starting a new Test run')
        debug_print(str_to_print="Starting a new Test run")

        # Looking at the Event Object:
        #print(f'Event Object:           {event}')
        debug_print(str_to_print="Event Object", obj_out=event)

        # Extract bucket name and key from the event
        #csv_line = event['Records'][0]['s3']['object']['key'] # This was the wrong JSON structure
        #csv_line = event['tasks'][0]['s3']['object']['key']
        #print(f'csv_line:               {csv_line}')
        #debug_print(str_to_print="csv_line", obj_out=csv_line)
        #
        s3BucketArn = event['tasks'][0]['s3BucketArn']
        s3Key       = event['tasks'][0]['s3Key']
        debug_print(str_to_print="s3BucketArn", obj_out=s3BucketArn)
        debug_print(str_to_print="s3Key", obj_out=s3Key)

        # Split the CSV line to get the bucket and key
        #src_bucket_name, src_key_path = csv_line.replace('"', '').split(',')
        src_bucket_name = s3BucketArn.split(':::')[1]
        src_key_path    = s3Key

        # Strip any extra spaces and quotes 
        src_bucket_name = src_bucket_name.strip()
        src_key_path = src_key_path.strip()
        #print(f'src_bucket_name:        {src_bucket_name}')
        #print(f'src_key_path:           {src_key_path}')
        debug_print(str_to_print="src_bucket_name", obj_out=src_bucket_name)
        debug_print(str_to_print="src_key_path", obj_out=src_key_path)

        # Extract the Source file name and rename to a .png in one step
        temp_filename_only  = src_key_path.split("/")[-1] + ".png" # Renaming to add .png on download.
        #print(f'temp_filename_only:     {temp_filename_only}')
        debug_print(str_to_print="temp_filename_only", obj_out=temp_filename_only)

        # For the Test, 
        # # The copies are ALL going in one place:
        # # # Example: csdap-cumulus-prod-internal/kstest/lambda_test/planet/PSScene3Band/20190917_203931_1039-thumbPUB.png

        # Create all the destination filenames 
        #
        # Destination Keypaths / Filenames
        # dest_1__file_name_only = temp_filename_only.replace("_thumb", "-thumbPUB")    # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png"       # This goes in the PUBLIC bucket
        # dest_2__file_name_only = temp_filename_only.replace("_thumb", "-thumb")   # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png"       # This goes in the PROTECTED bucket
        # dest_3__file_name_only = temp_filename_only.replace("_thumb", "-BROWSE")      # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-BROWSE.png"      # This goes in the PROTECTED bucket
        dest_1__file_name_only = temp_filename_only.replace("_thumb", "-thumb")     # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png"       # This goes in the PUBLIC bucket
        dest_2__file_name_only = temp_filename_only.replace("_thumb", "-thumb")     # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png"       # This goes in the PROTECTED bucket
        dest_3__file_name_only = temp_filename_only.replace("_thumb", "-BROWSE")    # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-BROWSE.png"      # This goes in the PROTECTED bucket
        #
        # dest_1__root_dir_keypath = f'kstest/lambda_test/planet/PSScene3Band/'  # f'planet/PSScene3Band/'
        # dest_2__root_dir_keypath = f'kstest/lambda_test/planet/PSScene3Band/'  # f'planet/PSScene3Band/'
        # dest_3__root_dir_keypath = f'kstest/lambda_test/planet/PSScene3Band/'  # f'planet/PSScene3Band/'
        dest_1__root_dir_keypath = f'planet/PSScene3Band/'  # f'planet/PSScene3Band/'
        dest_2__root_dir_keypath = f'planet/PSScene3Band/'  # f'planet/PSScene3Band/'
        dest_3__root_dir_keypath = f'planet/PSScene3Band/'  # f'planet/PSScene3Band/'
        #
        # dest_1__bucket_name = f'csdap-cumulus-prod-internal'
        # dest_2__bucket_name = f'csdap-cumulus-prod-internal'
        # dest_3__bucket_name = f'csdap-cumulus-prod-internal'
        dest_1__bucket_name = f'csdap-cumulus-prod-public'
        dest_2__bucket_name = f'csdap-cumulus-prod-protected'
        dest_3__bucket_name = f'csdap-cumulus-prod-protected'
        #
        dest_1__full_keypath = f'{dest_1__root_dir_keypath}{dest_1__file_name_only}'
        dest_2__full_keypath = f'{dest_2__root_dir_keypath}{dest_2__file_name_only}'
        dest_3__full_keypath = f'{dest_3__root_dir_keypath}{dest_3__file_name_only}'
        #
        #
        # print(f'dest_1__file_name_only:     {dest_1__file_name_only}')
        # print(f'dest_2__file_name_only:     {dest_2__file_name_only}')
        # print(f'dest_3__file_name_only:     {dest_3__file_name_only}')
        debug_print(str_to_print="dest_1__file_name_only", obj_out=dest_1__file_name_only)
        debug_print(str_to_print="dest_2__file_name_only", obj_out=dest_2__file_name_only)
        debug_print(str_to_print="dest_3__file_name_only", obj_out=dest_3__file_name_only)
        #
        # print(f'dest_1__root_dir_keypath:   {dest_1__root_dir_keypath}')
        # print(f'dest_2__root_dir_keypath:   {dest_2__root_dir_keypath}')
        # print(f'dest_3__root_dir_keypath:   {dest_3__root_dir_keypath}')
        debug_print(str_to_print="dest_1__root_dir_keypath", obj_out=dest_1__root_dir_keypath)
        debug_print(str_to_print="dest_2__root_dir_keypath", obj_out=dest_2__root_dir_keypath)
        debug_print(str_to_print="dest_3__root_dir_keypath", obj_out=dest_3__root_dir_keypath)
        #
        # print(f'dest_1__bucket_name:        {dest_1__bucket_name}')
        # print(f'dest_2__bucket_name:        {dest_2__bucket_name}')
        # print(f'dest_3__bucket_name:        {dest_3__bucket_name}')
        debug_print(str_to_print="dest_1__bucket_name", obj_out=dest_1__bucket_name)
        debug_print(str_to_print="dest_2__bucket_name", obj_out=dest_2__bucket_name)
        debug_print(str_to_print="dest_3__bucket_name", obj_out=dest_3__bucket_name)
        #
        # print(f'dest_1__full_keypath:       {dest_1__full_keypath}')
        # print(f'dest_2__full_keypath:       {dest_2__full_keypath}')
        # print(f'dest_3__full_keypath:       {dest_3__full_keypath}')
        debug_print(str_to_print="dest_1__full_keypath", obj_out=dest_1__full_keypath)
        debug_print(str_to_print="dest_2__full_keypath", obj_out=dest_2__full_keypath)
        debug_print(str_to_print="dest_3__full_keypath", obj_out=dest_3__full_keypath)

        # Copy the file to it's destinations
        copy_source = {'Bucket': src_bucket_name, 'Key': src_key_path}
        #
        s3.copy_object(CopySource=copy_source, Bucket=dest_1__bucket_name, Key=dest_1__full_keypath)
        s3.copy_object(CopySource=copy_source, Bucket=dest_2__bucket_name, Key=dest_2__full_keypath)
        s3.copy_object(CopySource=copy_source, Bucket=dest_3__bucket_name, Key=dest_3__full_keypath)

        #print(f'Files should all be copied now.')
        debug_print(str_to_print="Files should all be copied now.")

        # Define new bucket and key paths
        # # Note: All this stuff where we have the repeated names with 1,2,3 are incase we need to later send these files to different destinations.

        # dest_root_dir_keypath   = f'planet/PSScene3Band/'
        # dest_bucket__PUBLIC     = f'csdap-cumulus-prod-public'
        # dest_bucket__PROTECTED  = f'csdap-cumulus-prod-protected'
        #
        # dest_root_dir_keypath__1 = f'planet/PSScene3Band/'
        # dest_root_dir_keypath__2 = f'planet/PSScene3Band/'
        # dest_root_dir_keypath__3 = f'planet/PSScene3Band/'
        # #

        # dest_root_dir_keypath   = f'planet/PSScene3Band/'
        # dest_bucket__PUBLIC     = f'csdap-cumulus-prod-public'
        # dest_bucket__PROTECTED  = f'csdap-cumulus-prod-protected'

        # Passing the invocation ID back in the success info.
        #invocationSchemaVersion = event['invocationSchemaVersion']
        #success_info = f'invocationSchemaVersion: {invocationSchemaVersion}'
        #
        return {
            'statusCode': 200,
            'invocationSchemaVersion': event['invocationSchemaVersion'],
            'invocationId': event['invocationId'],
            'results': [
                {
                    'taskId': event['tasks'][0]['taskId'],
                    'resultCode': 'Succeeded',
                    'resultString': 'Copy Operations completed successfully'
                }
            ]
        }
    except:
        run_did_fail    = True
        success_info    = ""
        err_info        = str(sys.exc_info())
        #
        return {
            'statusCode': 500,
            'err_info': f'{err_info}'
        }

    # return {
    #     'err_info': f'{err_info}',
    #     'success_info': f'{success_info}'
    # }
krisstanton commented 3 months ago

Thumbnail Copy is complete. In order to copy/rename the Thumbs in the manner described in the ticket description above, I had to use S3 Batch Operations. At first, I attempted to run this as a local pythons cript that called the AWS CLI, but after attempting that I determined that the processing time would have been hundreds of hours!!

Here is what was done to make this copy/rename happen:

  1. Filter the entire inventory down to ONLY the Thumbnail files. CSV of Thumbs (bucket_name, key_path) serves as a 'manifest' input to the S3 Batch Operation.
  2. Lambda to process the CSV file Each Line on the CSV calls this lambda function The Lambda contained the Custom logic, including error catching, for performing the renames and copying. (See above comment for Lambda Code) Note: There were some permissions for roles that were required to make this work. Also, the timeout needed to be set for 15 seconds in order to catch many of the failed attempts at the first run.
  3. Batch Operations https://us-west-2.console.aws.amazon.com/s3/jobs?region=us-west-2# Configuration of the job is straight forward. Had to select where the manifest is located (custom csv was uploaded to an s3 location on the internal bucket). Had to tell the Batch operation to "Invoke AWS Lambda function" as opposed to just a straight "Copy". Also there is another location on S3 in the internal bucket where the batch operation reports end up. https://us-west-2.console.aws.amazon.com/s3/buckets/csdap-cumulus-prod-internal?region=us-west-2&bucketType=general&prefix=kstest/lambda_test/completion_reports/&showversions=false

There was a bit of back and forth trying to get the correct response format that the Batch Operations was expecting as well as get all the roles and permissions setup.

Also, once I was able to get it to run properly, there were two rounds of errors. The first round was because the lambda was set to timeout after 3 seconds (I originally thought this was sufficient because each of these thumbnail files are very small; in the 5 to 10 kilobyte range). The first round of running 2.3 million executions had about 11% errors due to timeouts. I fixed this and then had a second round with only 330 errors. These errors were due to an AWS limitation on S3 operations, occasionally the code hit a throttle limit for the account. This was surprising since this is an S3 batch operation, but it also makes sense since it is a custom lambda job. The third batch job was to only run the 330 failed granules (I made a python script to regenerate the smaller, 330 line csv from the outputs of the round 2 batch job).