cloudera-labs / hms-mirror

"hms-mirror" is a utility used to bridge the gap between two clusters and migrate hive metadata.
Apache License 2.0
14 stars 8 forks source link

CDW on Public cloud requires ability to copy data between S3 buckets via aws cli #32

Open hpasumarthi opened 1 year ago

hpasumarthi commented 1 year ago

Hello Team, In CDW Public cloud on AWS data is on S3 buckets. Copying data via hadoop cli or distcp is not possible for PC environments because we do not have hadoop clusters. Can we enhance hms-mirror to use aws cli commands to copy data between left and right table locations i.e S3 buckets.

e.g aws s3 cp s3://DOC-EXAMPLE-BUCKET-SOURCE s3://DOC-EXAMPLE-BUCKET-TARGET or

e.g aws s3 sync s3://DOC-EXAMPLE-BUCKET-SOURCE s3://DOC-EXAMPLE-BUCKET-TARGET

https://repost.aws/knowledge-center/move-objects-s3-bucket

Expectation is instead of running distcp commands, hms-mirror will use aws cli to copy data from left to right. Regards, Hemanth

hpasumarthi commented 1 year ago

Came up with small script which can be used to convert distcp into aws cli

if [ -f "$1" ]; then
    echo "##Working on file : $1"
else 
    echo "##File in the path $1 does not exist."
    exit
fi

echo "##Run set/export AWS_DEFAULT_PROFILE=sso.dev before running commands below"
grep 's3a://' $1 |sed 's/s3a:/s3:/g'| while read line 
do
   location_right=`echo $line |cut -d '|' -f3| xargs`
   echo $line|cut -d '|' -f4|sed 's/<br>/\n/g' | while read location_left
   do 
     tbl_name="${location_left##*/}"
     if [[ "$location_left" =~ .*"s3://".* ]]; then
        echo "aws s3 sync $location_left $location_right/$tbl_name"
     fi
   done
done

Running the script will print distcp locations as aws commands

% sh distcp_awscli.sh testdev_airlines_RIGHT_distcp_workbook.md

##Working on file : testdev_airlines_RIGHT_distcp_workbook.md
##Run set/export AWS_DEFAULT_PROFILE=sso.dev before running commands below
aws s3 sync s3://ps-uat2/testdev-iceberg/airlines-iceberg/flights s3://ps-uat22/testdev-iceberg/airlines-iceberg/flights
aws s3 sync s3://ps-uat2/testdev-iceberg/airlines-iceberg/flights_iceberg s3://ps-uat22/testdev-iceberg/airlines-iceberg/flights_iceberg
aws s3 sync s3://ps-uat2/testdev-iceberg/airlines-iceberg/planes s3://ps-uat22/testdev-iceberg/airlines-iceberg/planes