GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
870 stars 332 forks source link

Invalid Unicode path encountered #221

Open lasselammi opened 10 years ago

lasselammi commented 10 years ago

I'm trying to upload a file with filename "kuormatraktorityön suunnittelun työmalli.pdf" and I get the following error message: "CommandException: Invalid Unicode path encountered". I'm using the most recent version of gsutil. I was able to upload this same file with a previous version of gsutil. Why is this a problem now?

mfschwartz commented 10 years ago

That error happens when the path contains invalid Unicode chars. That filename itself is valid (I just tried creating a file with that name and was able to upload it using gsutil) - what is the full path printed by the gsutil error message?

lasselammi commented 10 years ago

I'm trying to upload a directory. The full error message that I get is below:

CommandException: Invalid Unicode path encountered ('ebd52a6f-8c5a-4921-b8ee->1096ffbc7a15\DB\13caefcd-a135-4bb0-915a-a619afdab5e5_description_\kuormatraktority\xf6n >suunnittelunty\xf6malli.pdf'). gsutil cannot proceed with such files present. Please remove or rename this file and try again.

I figured that the problem would be related to the filename, because 'ö' characters are being replaced with '\xf6' for some reason in the error message.

jterrace commented 10 years ago

It looks like your file is not actually a valid unicode string. If you take a look at the chart here: http://en.wikipedia.org/wiki/%C3%96#Codes_for_computing

Your file is in extended ascii (latin-1) format. Extended ascii characters (the characters above 128) are not valid unicode code points. The '\xf6' is a single byte, byte 246. If it were valid unicode, it would be two bytes, '\xc3\xb6', bytes 195 and 182.

jterrace commented 10 years ago

You also have this in there: \13, which is a vertical tab. That's valid unicode, but we generally advise not to put control characters in your object names. Any idea how a vertical tab character got in your file name?

michaleczky commented 10 years ago

With the current version I have the same problem, gsutil wasn't able to rsync a directory containing 'MEGHÍVÓ.docx' file, which contains valid unicode characters. (Previously I didn't meet with this problem.)

I used the following command: c:\Python27\python.exe c:\gsutil\gsutil.py rsync -C c:\Users\AUserFolder\Documents\ gs://a_bucket_name/

And received this message:

CommandException: Invalid Unicode path encountered ('c:\Users\AUserFolder\Documents\MEGH\xcdV\xd3.docx'). gsutil cannot proceed with such files present. Please remove or rename this file and try again.

jterrace commented 10 years ago

That is not a valid unicode string. Your filename is named as: MEGH\xcdV\xd3.docx

0xcd == 204 0xd3 == 211

Those are ISO 8859-encoded strings (aka latin-1). See: http://en.wikipedia.org/wiki/%C3%8D#Character_mappings http://en.wikipedia.org/wiki/%C3%93#Character_mappings

On *nix-based systems, you can use convmv to rename them. Install it with sudo apt-get install convmv. Then rename them like so: convmv -r -f ISO-8859-1 -t UTF-8 .

I don't know of an equivalent command on Windows.

zelon commented 10 years ago

I have same problem on windows with korean. I'm trying to modify https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/wildcard_iterator.py file like this: diff

jterrace commented 10 years ago

@zelon what error are you receiving and what types of Korean characters do you have in your filenames?

zelon commented 10 years ago

Sorry for late reply. I was moving. I'm using windows 8.1 Pro K with korean language and the error message is:

C:\Users\JINWOOK>gsutil -m cp -n -R P:\MyPictures gs://zelon-backup/ CommandException: Invalid Unicode path encountered ('P:\MyPictures\120707-\xb1\xe8\xb3 \xaa\xc0\xb1(\xc1\xdf\xb0\xa3).zip'). gsutil cannot proceed with such files present. Please remove or rename this file and try again.

The file name is : 120707-김나윤(중간).zip

jterrace commented 10 years ago

@zelon - could you paste the output of the following on your machine?

python -c "import locale; print locale.getdefaultlocale()"
zelon commented 10 years ago

like this: C:\Users\JINWOOK>python -c "import locale; print locale.getdefaultlocale()" ('ko_KR', 'cp949')

jterrace commented 10 years ago

@zelon - thank you for providing that information, that's great news. Since your locale is set properly, that means that we can decode your file names with the ko_KR encoding before encoding them into UTF8.

SergeAx commented 8 years ago

@jterrace - I've got the same issue on my Windows machine while trying to use gsutil rsync as a cloud backup solution:

CommandException: Invalid Unicode path encountered
('Z:\\Backup\\neo.sergeax.ru\\IPub\\a.gu.ru\\media\\D__\xcb\xe5\xe2\xe8\xed_rcrc.jpg').

The real file name is "D__Левин_rcrc.jpg" (it's Russan) Here's the output of getdefaultlocale:

C:\gsutil>python -c "import locale; print locale.getdefaultlocale()"
('ru_RU', 'cp1251')

What can I do to rsync this file (and thousands like it) to Google Cloud Storage? Thank you!

jterrace commented 8 years ago

cp1251 is not a supported byte encoding. You'll need to convert your filenames to UTF-8 before uploading.

SergeAx commented 8 years ago

Is that possible at all on NTFS filesystem?

jterrace commented 8 years ago

Yes, NTFS supports unicode. You could try the Windows version of the iconv program: http://gnuwin32.sourceforge.net/packages/libiconv.htm

SergeAx commented 8 years ago

Sorry, it's too unobvious for me. I believe it was Plato who said that to google a question one should know half of the answer :) After some research online I still know about 10%, clearly not enough. Can you point me to manual or article for backing up my non-UTF named files to Google Cloud Storage using gsutil in rsync mode?

Thank you very much anyway.

moander commented 8 years ago

I have the same problem on Windows, but if I mount the same disk on osx using smb it works just fine.

eblazer commented 8 years ago

I too experience this error, but ONLY when executed from crontab. When I manually run the same script from my shell, I don't get this error.

"Starting Google Cloud Storage RSync Mon Dec 7 10:00:01 PST 2015 Building synchronization state... Caught non-retryable exception while listing file:///volumeUSB1/usbshare/Test_1: 'ascii' codec can't encode character u'\u2019' in position 81: ordinal not in range(128) At destination listing 10000... "

eblazer commented 8 years ago

Discovered the issue, per gsutil help encoding

Unicode errors for valid Unicode filepaths can be caused by lack of Python locale configuration on Linux and Mac OSes. If your file paths are Unicode and you get encoding errors, ensure the LANG environment variable is set correctly. Typically, the LANG variable should be set to something like "en_US.UTF-8" or "de_DE.UTF-8".

My script being called in the crontab context did not have a LANG env var.

fredrikaverpil commented 7 years ago

Wow, how odd. I'm running gsutil rsync from within a centos:7 Docker container.

For some reason, LANG is not set and I get this:

>>> python -c "import locale; print locale.getdefaultlocale()"
(None, None)

Thanks to @eblazer this is now solved, by setting LANG=en_US.UTF-8 👍