bbrewington / gatech-cse6242-citibike

https://bbrewington.github.io/gatech-cse6242-citibike/dbt_docs.html
0 stars 0 forks source link

Look into refactoring to use Google Cloud storage transfer #7

Open bbrewington opened 1 year ago

bbrewington commented 1 year ago

Description

Currently, I'm running a github action that is doing this:

https://github.com/bbrewington/gatech-cse6242-citibike/blob/7cd80854b74d509b1c2d76496565053d121b07d5/.github/workflows/citibike_trip_history.yml#L36-L44

and here's the meat of copy_aws_to_gcs.py:

https://github.com/bbrewington/gatech-cse6242-citibike/blob/7cd80854b74d509b1c2d76496565053d121b07d5/src/copy_aws_to_gcs.py#L37-L52

This is a pretty clunky way of doing this. It's reading the data via curl, and piping it to gsutil cp - (note the "-" means the data from curl is passed into gsutil cp as stdin, which enables copying the data as-is, without writing to a "local" file first). I'm guessing using Google's storage transfer service would be more scalable and efficient than this...worth a try

The - symbol in {gcs_url_zip} is a special character that represents standard input or output. In this case, it is used to specify that the data received from curl should be copied to the GCS object as is, without writing it to a local file first.

Reference https://cloud.google.com/python/docs/reference/storagetransfer/latest/google.cloud.storage_transfer_v1.types.AwsS3CompatibleData

bbrewington commented 1 year ago

Notes from https://cloud.google.com/storage-transfer/docs/s3-compatible#gcloud-cli

Example code

Create Agent Pool

gcloud transfer agent-pools create NAME \
  [--no-async] \
  [--bandwidth-limit=BANDWIDTH_LIMIT] \
  [--display-name=DISPLAY_NAME]

Install Transfer Agents

Transfer agents are software agents that coordinate transfer activities from your source through Storage Transfer Service. They must be installed on a system with access to your source data.

AWS_ACCESS_KEY_ID=ID
AWS_SECRET_ACCESS_KEY=SECRET
gcloud transfer agents install --s3-compatible-mode --pool=POOL_NAME

Required IAM roles:

Note from gcloud transfer agents install --help

     --count=COUNT
        Specify the number of agents to install on your current machine. System
        requirements: 8 GB of memory and 4 CPUs per agent.

        Note: If the 'id-prefix' flag is specified, Transfer Service increments
        a number value after each prefix. Example: prefix1, prefix2, etc.

Create the transfer job

gcloud transfer jobs create s3://SOURCE_BUCKET_NAME gs://SINK_BUCKET_NAME \
  --source-agent-pool=POOL_NAME \
  --source-endpoint=ENDPOINT \
  --source-signing-region=REGION \
  --source-auth-method=AWS_SIGNATURE_V2 | AWS_SIGNATURE_V4 \
  --source-request-model=PATH_STYLE | VIRTUAL_HOSTED_STYLE \
  --source-network-protocol=HTTP | HTTPS \
  --source-list-api=LIST_OBJECTS | LIST_OBJECTS_V2

What throughput can be achieved for transfers from S3-compatible storage?

Your transfer throughput can be scaled by adding more transfer agents. We recommend using 3 agents for fault tolerance and to fill a <10Gbps pipe. To scale more, add more agents. Agents can be added and removed while a transfer is in process.

Where should transfer agents be deployed to transfer data from Amazon S3 to Cloud Storage?

You can install agents in Amazon EC2 or EKS within the same region as your bucket. You can also run agents on Google Cloud in the nearest region.