Open thaniri opened 3 years ago
@thaniri any resolution to this? I have the same problem. Just that my destination is on a Discovery Cluster and it seems to take forever with the "At destination listing ..." check
Same issue.
I have the same problem destination enumeration takes forever. Source is local storage, destination is GCP. Folders are quite large (4-6 million files), yet local enumeration is no problem. Remote enumeration part takes FOREVER. I had it going for 36 hours before I gave up. Breaking the upload into smaller batches but not looking forward to the validation since enumeration will take an eon after I've uploaded everything from multiple data sets.
I've also noticed a slow listing of the destination bucket. In my case the rsync
command appeared to be enumerating all files in the destination, despite the source only having a single file. If that's the case then only listing the files being transferred could have a significant performance increase. (Note -d
was not in use.)
Same issue for me - in my cause im using a command like:
gsutil -m rsync -i -r -y "some-str.*\.gz$" gs://bucket/path s3://bucket/path
Its extremely slow and eventually stalls kube pods, there is ~84k files (they are really small files - usually no more than 2kb) I can see both high CPU and memory:
I transferred 4.7 million files (about 2Tb in size in total) from a GCS bucket to an S3 bucket.
Then I ran the exact same command again to see how long it would take to sync diffs.
It took about 4 hours to list 700,000 files in the destination bucket.
The logs I am seeing look like this:
From the code, it looks like all that is being done here is iterating over the contents of the destination bucket and checking properties of the objects inside. https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/commands/rsync.py#L763
Listing the source bucket was quite quick, but listing the destination bucket is too slow for my purposes.
I was not running into any throttling issues on the AWS side, and when running
aws s3 ls <s3_bucket_name
on the command line, the listing process was significantly faster that the output of thegsutil rsync
command was reporting.The command itself is being run on a rather large EC2 instance inside of AWS in the same region as the S3 bucket with a VPC endpoint configured. The conditions for interacting with the S3 bucket probably cannot be appreciably improved.
The instance itself has a mostly idle CPU, minimal memory usage, zero disk usage, and is not even close to running out of bandwidth on the NIC.
In the process tree the command looks like this:
(The PID is such a high number because the command is being run in a screen session).
When running
strace
on some of the pids, I don't have the ability to follow up.Some of the child processes appear to be timing out on network calls repeatedly:
Another process is stuck on a read call:
This child process seems to be the most interesting, as using
lsof
I can see that it has TCP connections from the EC2 instance to both AWS S3 (ESTABLISHED) and with GCP GCS (CLOSE_WAIT).What this suggests to me is that the process has finished interacting with the GCS bucket (for now!), and is only interacting with the S3 bucket. This makes sense given that the only output I see from the command is
At destination listing x...
I have been notified by AWS support that I am not approaching any rate limits for S3, and the point at which this code is being slow for me is after it has stopped interacting with GCS.
Does anyone have an idea of why this is so slow? If the virtual machine was running out of resources that would be one thing, but it appears to just not be doing much at all when inspecting it on the command line.