GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
874 stars 334 forks source link

Slow listing of destination bucket in AWS S3 #1248

Open thaniri opened 3 years ago

thaniri commented 3 years ago

I transferred 4.7 million files (about 2Tb in size in total) from a GCS bucket to an S3 bucket.

gsutil -m rsync -r -DD -J gs://<bucket_name> s3://<s3_bucket_name>

Then I ran the exact same command again to see how long it would take to sync diffs.

It took about 4 hours to list 700,000 files in the destination bucket.

The logs I am seeing look like this:

At destination listing 970000...
At destination listing 980000...
At destination listing 990000...
At destination listing 1000000...

From the code, it looks like all that is being done here is iterating over the contents of the destination bucket and checking properties of the objects inside. https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L763

Listing the source bucket was quite quick, but listing the destination bucket is too slow for my purposes.

I was not running into any throttling issues on the AWS side, and when running aws s3 ls <s3_bucket_name on the command line, the listing process was significantly faster that the output of the gsutil rsync command was reporting.

The command itself is being run on a rather large EC2 instance inside of AWS in the same region as the S3 bucket with a VPC endpoint configured. The conditions for interacting with the S3 bucket probably cannot be appreciably improved.

The instance itself has a mostly idle CPU, minimal memory usage, zero disk usage, and is not even close to running out of bandwidth on the NIC.

In the process tree the command looks like this:

ubuntu   3491279 3491268  0 23:04 pts/3    00:00:00  |                       \_ /bin/sh /home/ubuntu/google-cloud-sdk/bin/gsutil -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491288 3491279  0 23:04 pts/3    00:00:00  |                           \_ python3 /home/ubuntu/google-cloud-sdk/bin/bootstrapping/gsutil.py -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491291 3491288  0 23:04 pts/3    00:00:00  |                               \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491293 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491295 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491419 3491291 57 23:04 pts/3    00:09:52  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491420 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491422 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491426 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491431 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491436 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491439 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491446 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491449 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491453 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491457 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>
ubuntu   3491462 3491291  0 23:04 pts/3    00:00:00  |                                   \_ python3 /home/ubuntu/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=<project_name> -o Credentials:use_client_certificate=False -m rsync -r -J gs://<bucket_name> s3://<s3_bucket_name>

(The PID is such a high number because the command is being run in a screen session).

When running strace on some of the pids, I don't have the ability to follow up.

Some of the child processes appear to be timing out on network calls repeatedly:

futex(0x2ac1390, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1618874669, tv_nsec=401310000}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

Another process is stuck on a read call:

read(6,

This child process seems to be the most interesting, as using lsof I can see that it has TCP connections from the EC2 instance to both AWS S3 (ESTABLISHED) and with GCP GCS (CLOSE_WAIT).

// omitted for brevity
python3 3491419 ubuntu   21u  IPv4          169126341      0t0       TCP ip-x-x-x-x.us-west-2.compute.internal:59352->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
python3 3491419 ubuntu   22w   REG              259,1        0      2363 /tmp/gsutil-rsync-dst-7snw22qf
python3 3491419 ubuntu   23u   REG              259,1  4962820     14023 /tmp/gsutil-rsync-dst-7snw22qf-000001
python3 3491419 ubuntu   24u  unix 0x0000000000000000      0t0 169114737 type=STREAM
python3 3491419 ubuntu   25u  IPv4          169102695      0t0       TCP ip-x-x-x-x.us-west-2.compute.internal:40466->sea15s11-in-f176.1e100.net:https (CLOSE_WAIT)
python3 3491419 ubuntu   26u   REG              259,1  5143215     18625 /tmp/gsutil-rsync-dst-7snw22qf-000002
python3 3491419 ubuntu   28u  IPv4          169114065      0t0       TCP ip-x-x-x-x.us-west-2.compute.internal:42018->sea09s29-in-f10.1e100.net:https (CLOSE_WAIT)
python3 3491419 ubuntu  120u   REG              259,1  5153606     18816 /tmp/gsutil-rsync-dst-7snw22qf-000000

What this suggests to me is that the process has finished interacting with the GCS bucket (for now!), and is only interacting with the S3 bucket. This makes sense given that the only output I see from the command is At destination listing x...

I have been notified by AWS support that I am not approaching any rate limits for S3, and the point at which this code is being slow for me is after it has stopped interacting with GCS.

Does anyone have an idea of why this is so slow? If the virtual machine was running out of resources that would be one thing, but it appears to just not be doing much at all when inspecting it on the command line.

adityaponnada commented 3 years ago

@thaniri any resolution to this? I have the same problem. Just that my destination is on a Discovery Cluster and it seems to take forever with the "At destination listing ..." check

vojkny commented 3 years ago

Same issue.

nullstream commented 2 years ago

I have the same problem destination enumeration takes forever. Source is local storage, destination is GCP. Folders are quite large (4-6 million files), yet local enumeration is no problem. Remote enumeration part takes FOREVER. I had it going for 36 hours before I gave up. Breaking the upload into smaller batches but not looking forward to the validation since enumeration will take an eon after I've uploaded everything from multiple data sets.

ag-TJNII commented 11 months ago

I've also noticed a slow listing of the destination bucket. In my case the rsync command appeared to be enumerating all files in the destination, despite the source only having a single file. If that's the case then only listing the files being transferred could have a significant performance increase. (Note -d was not in use.)

jooj123 commented 10 months ago

Same issue for me - in my cause im using a command like:

gsutil -m rsync -i -r -y "some-str.*\.gz$" gs://bucket/path s3://bucket/path

Its extremely slow and eventually stalls kube pods, there is ~84k files (they are really small files - usually no more than 2kb) I can see both high CPU and memory:

stuff