GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
875 stars 336 forks source link

cannot mv non-ascii path #244

Open qrtt1 opened 9 years ago

qrtt1 commented 9 years ago
(pyenv)[gcp@instance ~]$ gsutil mv 'gs://foo-videos/最孤單的人 - 廖文強與壞神經樂團 (Official Music Video).mp4' gs://foo-videos/mv.mp4
Failure: 'ascii' codec can't encode characters in position 16-20: ordinal not in range(128).
jterrace commented 9 years ago

This should be working. Could you post the output of gsutil version -l and also re-run that gsutil command with gsutil -d to show the stack trace?

qrtt1 commented 9 years ago
gsutil version: 4.7
checksum: 72839382f796ff3865e757959eed802f (OK)
boto version: 2.30.0
python version: 2.7.4 (default, Apr 18 2013, 00:07:37) [GCC 4.6.2 20111027 (Red Hat 4.6.2-2)]
OS: Linux 3.14.19-17.43.amzn1.x86_64
multiprocessing available: True
using cloud sdk: True
config path: /home/gcp/.config/gcloud/legacy_credentials/chingyichan.tw@gmail.com/.boto
gsutil path: /home/gcp/google-cloud-sdk/platform/gsutil/gsutil
compiled crcmod: False
installed via package manager: False
editable install: False

debug message:

***************************** WARNING *****************************
*** You are running gsutil with debug output enabled.
*** Be aware that debug output includes authentication credentials.
*** Make sure to remove the value of the Authorization header for
*** each HTTP request printed to the console prior to posting to
*** a public medium such as a forum post or Stack Overflow.
***************************** WARNING *****************************
gsutil version: 4.7
checksum: 72839382f796ff3865e757959eed802f (OK)
boto version: 2.30.0
python version: 2.7.4 (default, Apr 18 2013, 00:07:37) [GCC 4.6.2 20111027 (Red Hat 4.6.2-2)]
OS: Linux 3.14.19-17.43.amzn1.x86_64
multiprocessing available: True
using cloud sdk: True
config path: /home/gcp/.config/gcloud/legacy_credentials/chingyichan.tw@gmail.com/.boto
gsutil path: /home/gcp/google-cloud-sdk/platform/gsutil/gsutil
compiled crcmod: False
installed via package manager: False
editable install: False
Command being run: /home/gcp/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=iddsdkvip -d mv gs://foo-videos/最孤單的人 - 廖文強與壞神經樂團 (Official Music Video).mp4 gs://foo-videos/mv.mp4
config_file_list: ['/home/gcp/.config/gcloud/legacy_credentials/chingyichan.tw@gmail.com/.boto']
config: [('debug', '0'), ('working_dir', '/mnt/pyami'), ('https_validate_certificates', 'true'), ('debug', '0'), ('working_dir', '/mnt/pyami'), ('default_project_id', 'iddsdkvip')]
DEBUG: Exception stack trace:
    Traceback (most recent call last):
      File "/home/gcp/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 469, in _RunNamedCommandAndHandleExceptions
        debug_level, parallel_operations)
      File "/home/gcp/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 263, in RunNamedCommand
        return_code = command_inst.RunCommand()
      File "/home/gcp/google-cloud-sdk/platform/gsutil/gslib/commands/mv.py", line 149, in RunCommand
        self.debug, self.parallel_operations)
      File "/home/gcp/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 256, in RunNamedCommand
        args = HandleArgCoding(args)
      File "/home/gcp/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 77, in HandleArgCoding
        decoded = arg.decode(UTF8)
      File "/home/gcp/pyenv/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-20: ordinal not in range(128)
jterrace commented 9 years ago

Hmm, so I can't reproduce your error. I copied a local file to your exact object name above, then copied it back to my local disk using your same command.

I wonder if the command line you're pasting above is being converted from a different character encoding to valid UTF8 when pasting to your browser. Could you provide the output of the locale command?

qrtt1 commented 9 years ago

Hello, I try the copy command it work correctly. However, move doesn't do it well

(pyenv)[gcp@deploy ~]$ gsutil mv gs://muzee-vips/最孤單的人\ -\ 廖文強與壞神經樂團\ \(Official\ Music\ Video\).2.mp4  gs://muzee-vips/最孤單的人\ -\ 廖文強與壞神經樂團\ \(Official\ Music\ Video\).23.mp4
Failure: 'ascii' codec can't encode characters in position 16-20: ordinal not in range(128).
(pyenv)[gcp@deploy ~]$ gsutil cp gs://muzee-vips/最孤單的人\ -\ 廖文強與壞神經樂團\ \(Official\ Music\ Video\).2.mp4  gs://muzee-vips/最孤單的人\ -\ 廖文強與壞神經樂團\ \(Official\ Music\ Video\).23.mp4
Copying gs://muzee-vips/最孤單的人 - 廖文強與壞神經樂團 (Official Music Video).2.mp4 [Content-Type=video/mp4]...
qrtt1 commented 9 years ago

I found the mv will invoke the cp. The args will be decode to unicode more than once:

diff --git a/platform/gsutil/gslib/command_runner.py b/platform/gsutil/gslib/command_runner.py
index 5f62b1f..ae7f829 100755
--- a/platform/gsutil/gslib/command_runner.py
+++ b/platform/gsutil/gslib/command_runner.py
@@ -74,7 +74,13 @@ def HandleArgCoding(args):
   processing_header = False
   for i in range(len(args)):
     arg = args[i]
-    decoded = arg.decode(UTF8)
+
+    # Don't decode the unicode string twice
+    if not isinstance(arg, unicode):
+      decoded = arg.decode(UTF8)
+    else:
+      decoded = arg
+
     if processing_header:
       if arg.lower().startswith('x-goog-meta'):
         args[i] = decoded
diff --git a/platform/gsutil/gslib/command_run

I add the unicode check and it work :P

jterrace commented 9 years ago

Ah, I missed that in the stack trace. Thanks for tracking it down! We'll get a fix out ASAP.

thobrla commented 9 years ago

Fixed for 4.8 with https://github.com/GoogleCloudPlatform/gsutil/commit/e324f162e48e089a075bc5d832df234acbfee59c.

paskal commented 9 years ago
[~]$ gsutil --version
gsutil version: 4.13

Command: gsutil -d -m rsync -r -x (log_to_sync)/ /destination_folder gs://bucket_name/destination_folder Debug log: http://pastebin.com/VziY121J Error raised on folder named düsseldorf Error itself:

Caught non-retryable exception while listing file:///destination_folder: 'ascii' codec can't encode character u'\xfc' in position 56: ordinal not in range(128)
DEBUG: Exception stack trace:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/dist-packages/gslib/__main__.py", line 524, in _RunNamedCommandAndHandleExceptions
        debug_level, parallel_operations)
      File "/usr/lib/python2.7/dist-packages/gslib/command_runner.py", line 277, in RunNamedCommand
        return_code = command_inst.RunCommand()
      File "/usr/lib/python2.7/dist-packages/gslib/commands/rsync.py", line 971, in RunCommand
        diff_iterator = _DiffIterator(self, src_url, dst_url)
      File "/usr/lib/python2.7/dist-packages/gslib/commands/rsync.py", line 674, in __init__
        raise CommandException('Caught non-retryable exception - aborting rsync')
    CommandException: CommandException: Caught non-retryable exception - aborting rsync   

Please tell me if any additional info is needed.

mfschwartz commented 9 years ago

@paskal - I'm unable to repro the problem you reported. I ran the same exact command you did (but using my own bucket) and it succeeded. Can you provide a listing of the objects in the source dir and destination bucket from before you run the gsutil rsync command? If you'd rather not post the list on the public forum please email to me at gs-team@google.com.

paskal commented 9 years ago

Wrote a letter to gs-team@google.com themed gsutil bug #244 - can't rsync non-ascii folder with additional info, thanks for rapid response. Also:

[~]#  gsutil version -l
gsutil version: 4.13
checksum: PACKAGED_GSUTIL_INSTALLS_DO_NOT_HAVE_CHECKSUMS (!= 141a3e09b42e1b0b6033108aa24c2286)
boto version: 2.38.0
python version: 2.7.3 (default, Feb 27 2014, 19:58:35) [GCC 4.6.3]
OS: Linux 3.8.0-35-generic
multiprocessing available: True
using cloud sdk: False
config path: /root/.boto
gsutil path: /usr/bin/gsutil
compiled crcmod: True
installed via package manager: True
editable install: False
paskal commented 9 years ago

Thanks a lot for help, it's turned out to be unset locale settings:

root@my_server:~# env|grep -E '(LC|LANG)'
LC_ALL=C
LANG=C
LANGUAGE=C
root@my_server:~# /usr/bin/gsutil -m rsync -r /hosted/aaa/images/ gs://bucket_name/hosted/aaa/images/
Building synchronization state...
Caught non-retryable exception while listing file:///hosted/aaa/images/: 'ascii' codec can't encode character u'\xfc' in position 56: ordinal not in range(128)
CommandException: Caught non-retryable exception - aborting rsync
Caught ^C - exiting
root@my_server:~# export LC_ALL=en_US.UTF-8
root@my_server:~# /usr/bin/gsutil -m rsync -r /hosted/aaa/images/ gs://bucket_name/hosted/aaa/images/
Building synchronization state...
Starting synchronization
Copying file:///hosted/aaa/images/athens/file_to_sync.jpg [Content-Type=image/jpeg]...

So, if you're getting codec can't encode character * in position *: ordinal not in range (128), check your locale settings, and if they're not set, add export LC_ALL=en_US.UTF-8 to your .bashrc file. Another way to do that inside python:

from os import environ
from subprocess import check_call
command = 'gsutil -m rsync {parameters} {folder} {bucket}{folder}'
check_call(command.split(' '), env=dict(environ, LC_ALL="en_US.UTF-8"))

Feel free to close this one.

eyalfink commented 8 years ago

I'm still seeing this problem which using gsutil cp: I'm feeding gsutil with it's own output with -I

gsutil ls gs://non-ascii-bug|gsutil cp -I gs://ywz-tmp

I'm getting

ValidationError: Field object encountered non-ASCII string 'File 3D_09 - MANOPOLE_09_06.1-PVZ_7317125_PVZ33x123_\xd0\xab20.stl': 'ascii' codec can't decode byte 0xd0 in position 52: ordinal not in range(128)

(you can try the above your self) I'm using gsutil version 4.15 and did

export LC_ALL=en_US.UTF-8

I see this bug is still open - is it suppose to be fixed?