GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
874 stars 334 forks source link

gsutil cp gs://source_file s3://destination_file does not work on compose produced files #569

Open vic4hub opened 6 years ago

vic4hub commented 6 years ago

When copying file(s) from google cloud storage directly to amazon simple storage, the upload stalls indefinitely without a sign of life...

I am on gsutil version: 4.34 on a mac and on Google's Debian GNU/Linux 9 (stretch) Compute Engine VM, this issue has been true since at least 4.31

I am trying to upload files which are ~300mb This does not seem to work via rsync either.

Please advise.

houglum commented 6 years ago

Hm. I tried to reproduce this on a VM with the same image (Debian 9, with 4 cores and 15GB of ram), and I noticed that the progress spinner doesn't update itself (it stays at [0 MiB / <total> MiB]), but the copy eventually does finish (after ~10 sec). To verify that progress was actually being made, I ran the cp command with -D, (e.g. gsutil -D cp ...), and saw the successive chunks being downloaded from the gs:// file. Try doing that and seeing if you're getting retryable errors from either the gs:// download or the s3:// upload.

vic4hub commented 6 years ago

Hey Matt, the debug log gets stuck here:

DEBUG 1013 23:27:19.062789 https_connection.py] wrapping ssl socket; CA certificate file=/usr/lib/google-cloud-sdk/platform/gsutil/third_party/boto/boto/cacerts/cacerts.txt DEBUG 1013 23:27:19.077381 https_connection.py] validating server certificate: hostname=krux-partners.s3.amazonaws.com, certificate hosts=['*.s3.amazonaws.com', 's3.amazonaws.com']

I can provide the full log if need be, what is odd, is that uploading a local file to S3 goes through immediately (as oppose to directly from google storage).

houglum commented 6 years ago

Maybe a silly question, but have you also verified that downloading the GCS object to a local file succeeds?

Behind the scenes, we do a "daisy chain" for cross-provider transfers. That is, for a GCS -> S3 transfer, we start downloading the GCS object to the client machine and pipe those bytes through, using them as the source "file" for the S3 upload. So if both a GCS -> local-file download and a local-file -> S3 upload work, the GCS -> S3 transfer should succeed as well (in theory, anyway).

vic4hub commented 6 years ago

I have done just that as a work around for a while but alas no luck without downloading files locally. Also thank you for looking into this.

houglum commented 6 years ago

That's an odd place for the output to stop. I would at least expect to see a log entry for the HTTP call corresponding to the first bucket listing operation. When I run this command (with the bucket names redacted), I see the point in the logs you're talking about, but it doesn't pause there for me... it lists the keys/objects in the bucket under the specified prefix (followed by performing the copy afterward):

$ gsutil -D cp gs://<GCS_BUCKET_NAME>/deleteme-300mb.txt s3://<S3_BUCKET_NAME>/deleteme-300mb.txt
...
DEBUG 1015 23:54:11.255021 https_connection.py] wrapping ssl socket; CA certificate file=/usr/lib/google-cloud-sdk/platform/gsutil/third_party/boto/boto/cacerts/cacerts.txt
DEBUG 1015 23:54:11.332671 https_connection.py] validating server certificate: hostname=<S3_BUCKET_NAME>.s3.amazonaws.com, certificate hosts=['*.s3.amazonaws.com', 's3.amazonaws.com']
send: u'GET /?delimiter=/&prefix=deleteme-300mb.txt HTTP/1.1\r\nHost: S3_BUCKET_NAME.s3.amazonaws.com\r\nAccept-Encoding: identity\r\nDate: Mon, 15 Oct 2018 23:54:11 GMT\r\nContent-Length: 0\r\nAuthorization: AWS <SIGNATURE>=\r\nUser-Agent: Boto/2.48.0 Python/2.7.13 Linux/4.9.0-7-amd64 gsutil/4.34 (linux2) google-cloud-sdk/220.0.0 analytics/disabled\r\n\r\n'
...

I'm only trying to transfer one object here, and I've tried with and without the top-level -m flag -- both worked and showed progress in the debug logs. Are you trying to sync/copy an entire bucket or directory of objects? If so, could you let me know if this still fails when you try copying 1 individual object?

vic4hub commented 6 years ago

Never mind - it seems to rather be a GS permissions related issue, pretty subtle one. still investigating..

vic4hub commented 6 years ago

For anyone who runs into this issue here is a bash workaround: for f in $(gsutil ls gs://some_gs_bucket/folder); do gsutil cat ${f} | gsutil cp -n - "s3:/some_s3_bucket/folder/${f##*/}"; done

vic4hub commented 5 years ago

Another bit of detail, this permission issues seems to happen only on files which are product of compose

mkalmanson commented 5 years ago

I believe I've tracked down the source of this issue. Just to confirm, can you please remove "use-sigv4 = True" from the [s3] section of ~/.boto and see if that fixes the problem?

vic4hub commented 5 years ago

Hey @mkalmanson I do not have ~/.boto configured, I am using standard Google VM with .aws/credentials with only the below variables therein: [default] aws_access_key_id = BLAHBLAHBLAH aws_secret_access_key = BLAHBLAHBLAHBLAHBLAHBLAH region=us-east-1 output=table

tfpereira commented 3 years ago

was this ever figured out? I'm having a similar issue, I'm trying to migrate some data between GCS and S3 (around 200k files), and while running a gsutil cp from GCS to my VM and then running another gsutil cp from the VM to S3 works, if I try to run directly gsutil cp gcs:// s3:// it just hangs. if I run it in debug mode I see a lot of

DEBUG 0510 09:52:01.297981 connection.py] encountered BrokenPipeError exception, reconnecting

in what looks like is uploading to S3

mshytikov commented 2 years ago

After trying almost everything, the only that helped me to avoid

encountered BrokenPipeError exception, reconnecting

is to set the correct s3 host, for example

gsutil -m -o 's3:host=s3.eu-west-1.amazonaws.com' rsync -r ...

and it also works without -m

MartinUQ commented 1 year ago

Ok this was so annoying but i used gsutil copy to local. And then local to aws via aws s3 command. Directly storage to s3 didn't work.