Closed chuckwondo closed 3 months ago
Here's an example Lambda function that works with AWS Batch: https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-invoke-lambda.html
WIP Update:
Number of Total Planet Files: 39,921,082
Number of "_thumb" files from above file list: 2,361,441
Here is the Custom Lambda used in the S3 Batch Copy Operation Note: I left all the commented code to show the original testing of this functionality.
import json
import boto3
import sys
s3 = boto3.client('s3')
# Setting this to True will create significant output to Cloudwatch Logs
SETTING__IS_OUTPUT_DEBUG_MODE = False
# When we need to print the output to the logs to see what is going on
def debug_print(str_to_print="", obj_out=None):
if(SETTING__IS_OUTPUT_DEBUG_MODE == True):
print(f'{str_to_print}: {obj_out}')
def lambda_handler(event, context):
run_did_fail = False
err_info = ""
success_info = ""
try:
#print('SETTING__IS_OUTPUT_DEBUG_MODE is set to False')
#print(f'Starting a new Test run')
debug_print(str_to_print="Starting a new Test run")
# Looking at the Event Object:
#print(f'Event Object: {event}')
debug_print(str_to_print="Event Object", obj_out=event)
# Extract bucket name and key from the event
#csv_line = event['Records'][0]['s3']['object']['key'] # This was the wrong JSON structure
#csv_line = event['tasks'][0]['s3']['object']['key']
#print(f'csv_line: {csv_line}')
#debug_print(str_to_print="csv_line", obj_out=csv_line)
#
s3BucketArn = event['tasks'][0]['s3BucketArn']
s3Key = event['tasks'][0]['s3Key']
debug_print(str_to_print="s3BucketArn", obj_out=s3BucketArn)
debug_print(str_to_print="s3Key", obj_out=s3Key)
# Split the CSV line to get the bucket and key
#src_bucket_name, src_key_path = csv_line.replace('"', '').split(',')
src_bucket_name = s3BucketArn.split(':::')[1]
src_key_path = s3Key
# Strip any extra spaces and quotes
src_bucket_name = src_bucket_name.strip()
src_key_path = src_key_path.strip()
#print(f'src_bucket_name: {src_bucket_name}')
#print(f'src_key_path: {src_key_path}')
debug_print(str_to_print="src_bucket_name", obj_out=src_bucket_name)
debug_print(str_to_print="src_key_path", obj_out=src_key_path)
# Extract the Source file name and rename to a .png in one step
temp_filename_only = src_key_path.split("/")[-1] + ".png" # Renaming to add .png on download.
#print(f'temp_filename_only: {temp_filename_only}')
debug_print(str_to_print="temp_filename_only", obj_out=temp_filename_only)
# For the Test,
# # The copies are ALL going in one place:
# # # Example: csdap-cumulus-prod-internal/kstest/lambda_test/planet/PSScene3Band/20190917_203931_1039-thumbPUB.png
# Create all the destination filenames
#
# Destination Keypaths / Filenames
# dest_1__file_name_only = temp_filename_only.replace("_thumb", "-thumbPUB") # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png" # This goes in the PUBLIC bucket
# dest_2__file_name_only = temp_filename_only.replace("_thumb", "-thumb") # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png" # This goes in the PROTECTED bucket
# dest_3__file_name_only = temp_filename_only.replace("_thumb", "-BROWSE") # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-BROWSE.png" # This goes in the PROTECTED bucket
dest_1__file_name_only = temp_filename_only.replace("_thumb", "-thumb") # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png" # This goes in the PUBLIC bucket
dest_2__file_name_only = temp_filename_only.replace("_thumb", "-thumb") # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-thumb.png" # This goes in the PROTECTED bucket
dest_3__file_name_only = temp_filename_only.replace("_thumb", "-BROWSE") # Converts "20190917_203931_1039_thumb.png" into "20190917_203931_1039-BROWSE.png" # This goes in the PROTECTED bucket
#
# dest_1__root_dir_keypath = f'kstest/lambda_test/planet/PSScene3Band/' # f'planet/PSScene3Band/'
# dest_2__root_dir_keypath = f'kstest/lambda_test/planet/PSScene3Band/' # f'planet/PSScene3Band/'
# dest_3__root_dir_keypath = f'kstest/lambda_test/planet/PSScene3Band/' # f'planet/PSScene3Band/'
dest_1__root_dir_keypath = f'planet/PSScene3Band/' # f'planet/PSScene3Band/'
dest_2__root_dir_keypath = f'planet/PSScene3Band/' # f'planet/PSScene3Band/'
dest_3__root_dir_keypath = f'planet/PSScene3Band/' # f'planet/PSScene3Band/'
#
# dest_1__bucket_name = f'csdap-cumulus-prod-internal'
# dest_2__bucket_name = f'csdap-cumulus-prod-internal'
# dest_3__bucket_name = f'csdap-cumulus-prod-internal'
dest_1__bucket_name = f'csdap-cumulus-prod-public'
dest_2__bucket_name = f'csdap-cumulus-prod-protected'
dest_3__bucket_name = f'csdap-cumulus-prod-protected'
#
dest_1__full_keypath = f'{dest_1__root_dir_keypath}{dest_1__file_name_only}'
dest_2__full_keypath = f'{dest_2__root_dir_keypath}{dest_2__file_name_only}'
dest_3__full_keypath = f'{dest_3__root_dir_keypath}{dest_3__file_name_only}'
#
#
# print(f'dest_1__file_name_only: {dest_1__file_name_only}')
# print(f'dest_2__file_name_only: {dest_2__file_name_only}')
# print(f'dest_3__file_name_only: {dest_3__file_name_only}')
debug_print(str_to_print="dest_1__file_name_only", obj_out=dest_1__file_name_only)
debug_print(str_to_print="dest_2__file_name_only", obj_out=dest_2__file_name_only)
debug_print(str_to_print="dest_3__file_name_only", obj_out=dest_3__file_name_only)
#
# print(f'dest_1__root_dir_keypath: {dest_1__root_dir_keypath}')
# print(f'dest_2__root_dir_keypath: {dest_2__root_dir_keypath}')
# print(f'dest_3__root_dir_keypath: {dest_3__root_dir_keypath}')
debug_print(str_to_print="dest_1__root_dir_keypath", obj_out=dest_1__root_dir_keypath)
debug_print(str_to_print="dest_2__root_dir_keypath", obj_out=dest_2__root_dir_keypath)
debug_print(str_to_print="dest_3__root_dir_keypath", obj_out=dest_3__root_dir_keypath)
#
# print(f'dest_1__bucket_name: {dest_1__bucket_name}')
# print(f'dest_2__bucket_name: {dest_2__bucket_name}')
# print(f'dest_3__bucket_name: {dest_3__bucket_name}')
debug_print(str_to_print="dest_1__bucket_name", obj_out=dest_1__bucket_name)
debug_print(str_to_print="dest_2__bucket_name", obj_out=dest_2__bucket_name)
debug_print(str_to_print="dest_3__bucket_name", obj_out=dest_3__bucket_name)
#
# print(f'dest_1__full_keypath: {dest_1__full_keypath}')
# print(f'dest_2__full_keypath: {dest_2__full_keypath}')
# print(f'dest_3__full_keypath: {dest_3__full_keypath}')
debug_print(str_to_print="dest_1__full_keypath", obj_out=dest_1__full_keypath)
debug_print(str_to_print="dest_2__full_keypath", obj_out=dest_2__full_keypath)
debug_print(str_to_print="dest_3__full_keypath", obj_out=dest_3__full_keypath)
# Copy the file to it's destinations
copy_source = {'Bucket': src_bucket_name, 'Key': src_key_path}
#
s3.copy_object(CopySource=copy_source, Bucket=dest_1__bucket_name, Key=dest_1__full_keypath)
s3.copy_object(CopySource=copy_source, Bucket=dest_2__bucket_name, Key=dest_2__full_keypath)
s3.copy_object(CopySource=copy_source, Bucket=dest_3__bucket_name, Key=dest_3__full_keypath)
#print(f'Files should all be copied now.')
debug_print(str_to_print="Files should all be copied now.")
# Define new bucket and key paths
# # Note: All this stuff where we have the repeated names with 1,2,3 are incase we need to later send these files to different destinations.
# dest_root_dir_keypath = f'planet/PSScene3Band/'
# dest_bucket__PUBLIC = f'csdap-cumulus-prod-public'
# dest_bucket__PROTECTED = f'csdap-cumulus-prod-protected'
#
# dest_root_dir_keypath__1 = f'planet/PSScene3Band/'
# dest_root_dir_keypath__2 = f'planet/PSScene3Band/'
# dest_root_dir_keypath__3 = f'planet/PSScene3Band/'
# #
# dest_root_dir_keypath = f'planet/PSScene3Band/'
# dest_bucket__PUBLIC = f'csdap-cumulus-prod-public'
# dest_bucket__PROTECTED = f'csdap-cumulus-prod-protected'
# Passing the invocation ID back in the success info.
#invocationSchemaVersion = event['invocationSchemaVersion']
#success_info = f'invocationSchemaVersion: {invocationSchemaVersion}'
#
return {
'statusCode': 200,
'invocationSchemaVersion': event['invocationSchemaVersion'],
'invocationId': event['invocationId'],
'results': [
{
'taskId': event['tasks'][0]['taskId'],
'resultCode': 'Succeeded',
'resultString': 'Copy Operations completed successfully'
}
]
}
except:
run_did_fail = True
success_info = ""
err_info = str(sys.exc_info())
#
return {
'statusCode': 500,
'err_info': f'{err_info}'
}
# return {
# 'err_info': f'{err_info}',
# 'success_info': f'{success_info}'
# }
Thumbnail Copy is complete. In order to copy/rename the Thumbs in the manner described in the ticket description above, I had to use S3 Batch Operations. At first, I attempted to run this as a local pythons cript that called the AWS CLI, but after attempting that I determined that the processing time would have been hundreds of hours!!
Here is what was done to make this copy/rename happen:
There was a bit of back and forth trying to get the correct response format that the Batch Operations was expecting as well as get all the roles and permissions setup.
Also, once I was able to get it to run properly, there were two rounds of errors. The first round was because the lambda was set to timeout after 3 seconds (I originally thought this was sufficient because each of these thumbnail files are very small; in the 5 to 10 kilobyte range). The first round of running 2.3 million executions had about 11% errors due to timeouts. I fixed this and then had a second round with only 330 errors. These errors were due to an AWS limitation on S3 operations, occasionally the code hit a throttle limit for the account. This was surprising since this is an S3 batch operation, but it also makes sense since it is a custom lambda job. The third batch job was to only run the 330 failed granules (I made a python script to regenerate the smaller, 330 line csv from the outputs of the round 2 batch job).
The granules in the PSScene3Band collection contain thumbnail files with a
_thumb
suffix, but Aaron wants us to duplicate the thumbnail files as "browse" files, like we have with our Maxar granules, where the thumbnail file goes to the public bucket and the "browse" file goes to the protected bucket (so the "browse" file is in the list of downloadable files for a granule).One possible option for meeting the following acceptance criteria is to use AWS Batch with a Lambda function. This would also require generating an S3 inventory for the objects in
s3://csdap-cumulus-prod-protected/planet/PSScene3Band/
, and filtering for*_thumb
files (filtering might be possible as part of the batch job config, but if not, the logic can be added to the Lambda function). (Update: Verified that S3 Inventory prefix parameter cannot handle the wild cards or filtering)UPDATE (Sprint 4, March 2024): Source Bucket is the 5982 protected bucket
20150601_121238_090c_thumb
Note that there is no extension on this file name however, these files arepng
type. So when making the inventory, we need to select it by*_thumb*
*-thumb.png
*-thumb.png
*-BROWSE.png
Note: This is all prep for being able to do the migration ingest which will use cumulus to move files from 5982 to 5047
SubTasks
protected
bucket from OLD PROD (AWS -5982). Alternate Steps (Update: S3 Inventory prefix param cannot handle wild cards or filtering)Original Steps
Acceptance criteria
*_thumb
files in thecsdap-cumulus-prod-protected
bucket (accountcsdap-cumulus-prod-5982
) with key prefixplanet/PSScene3Band/
exist next to the*_thumb
files and named*-BROWSE.png
*_thumb
files are renamed to*-thumb.png