GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
875 stars 334 forks source link

some directories with brackets don't upload #290

Open lihanli opened 9 years ago

lihanli commented 9 years ago

I'm getting a "CommandException: No URLs matched:" message when trying to upload a directory that has brackets in the name.

$ mkdir 'hello [foo]'
$ touch hello\ \[foo\]/bar
$ gsutil cp -r 'hello [foo]' gs://test5903495034959
CommandException: No URLs matched: hello [foo]
$ mv hello\ \[foo\] hello
$ gsutil cp -r 'hello' gs://test5903495034959
Copying file://hello/bar [Content-Type=application/octet-stream]...
mfschwartz commented 9 years ago

At present gsutil doesn't support object names that contain wildcard characters. A possible solution is discussed at https://github.com/GoogleCloudPlatform/gsutil/issues/291, but (a) note that that solution would provide only limited support (e.g., you wouldn't be able to use wildcards with such object names), and (b) we don't currently have plans to implement this support.

Is it possible for you to avoid using wildcard characters in object names? Even if we implemented the above proposed solution, I think you will find things are more complicated and difficult to use in raw mode, and that it would work better to avoid using wildcard characters in object names.

lihanli commented 9 years ago

I'll just use the web uploader if it can't be done, thanks

picklit commented 7 years ago

When using gsutil for backup it isn't always possible to limit the characters used in file names, other then what is valid in the filesystem. [ and ] are valid characters for filenames, but gsutil rejects them as wildcards?

thobrla commented 7 years ago

Yes, this is a limitation/bug in gsutil's current implementation, which doesn't do a good job of distinguishing object/filenames retrieved as strings through enumeration from potentially-wildcarded strings provided as command-line arguments.

Even if we fixed this limitation, the only way to reference such names would be indirectly by using wildcards. This is because gsutil does not support a "no-wildcards" mode or any kind of escaping, so there would be no way to distinguish it directly on the command line. For that reason, we recommend renaming files that have these in the name if you are going to use gsutil.

I'm leaving this issue open to track the limitation in case we (or an open-source contributor) decides to fix it.

thobrla commented 7 years ago

Some additional discussion from https://github.com/GoogleCloudPlatform/gsutil/issues/405 - the fix for the limitation is described by the TODO in the code referenced there:

# TODO: Disambiguate user-supplied strings from iterated
# prefix and object names so that we can better reason
# about wildcards and handle this case without raising an error.

However, this fix still would not allow gsutil to receive such files/objects as command-line arguments without the addition of a raw mode as described in https://github.com/GoogleCloudPlatform/gsutil/issues/220 or the addition or escape characters which have backwards-compatibility implications.

houglum commented 7 years ago

Adding a Stack Overflow question where a user ran into this for tracking purposes: http://stackoverflow.com/questions/42087510/gsutil-ls-returns-error-contains-wildcard

fredrikaverpil commented 7 years ago

What other characters in a filename would trigger this issue, where a file/folder isn't being processed?

It's a bit unclear to me when or how gsutil determines that there's a wildcard defined in a filename. For example "-" (dash) could be considered a wildcard character, however filenames with dashes work fine with gsutil.

houglum commented 7 years ago

It looks like our wildcard regex is here: https://github.com/GoogleCloudPlatform/gsutil/blob/e76b3c113194f3e17a89e98360f02f43fa63641e/gslib/storage_url.py#L38

Which includes the question mark, asterisk, and left and right square bracket characters.

On Feb 11, 2017 10:33 PM, "Fredrik Averpil" notifications@github.com wrote:

What other characters in a filename would trigger this issue, where a file/folder isn't being processed?

It's a bit unclear to me when or how gsutil determines that there's a wildcard defined in a filename. For example "-" (dash) could be considered a wildcard character, however filenames with dashes work fine with gsutil.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/gsutil/issues/290#issuecomment-279200126, or mute the thread https://github.com/notifications/unsubscribe-auth/AONFvY2aQm4VbPt2Nq_goqYvCbsKvmKYks5rbqe2gaJpZM4FgeVF .

fredrikaverpil commented 7 years ago

@houglum thanks.

This seems limited to folders only, right? I can clearly see I've been able to upload files with brackets using gsutil rsync. So I'm assuming this issue is limited just some aspects of gsutil?

fredrikaverpil commented 7 years ago

So, I'm quite confused now. It's not clear to me exactly when I should expect bucket uploads to skip folders (with brackets in their names). This doesn't feel good as I'm not clear on when or why files may not end up on the bucket (which is very important that they do).

I just tried uploading a folder with brackets in its name:

gsutil -m rsync -r -e -c . gs://BUCKET/PATH_TO/src/
Building synchronization state...
Starting synchronization
Copying file://./[knob projVar.shot]_renderCam_masterLayer/foo.txt [Content-Type=text/plain]...
/ [1/1 files][    0.0 B/    0.0 B]
Operation completed over 1 objects.

...which worked without issues. Then I attempted to list the contents of the parent folder:

gsutil ls -lR gs://BUCKET/PATH_TO/src/
gs://BUCKET/PATH_TO/src/:
CommandException: Cloud folder gs://BUCKET/PATH_TO/src/[knob projVar.shot]_renderCam_masterLayer/ contains a wildcard; gsutil does not currently support objects with wildcards in their name.

So essentially, in this case, I am able to upload folders with brackets. But I can't list them?

I then wanted to make sure I could download the folder and its contents:

gsutil -m rsync -r -e -c gs://BUCKET/PATH_TO/src/ /home/fredrik/downloadtest/
Building synchronization state...
Starting synchronization
...
Copying gs://BUCKET/PATH_TO/src/[knob projVar.shot]_renderCam_masterLayer/foo.txt...
...

Downloading worked fine.

So does this mean that this issue is limited to just certain aspects of gsutil, such as cp and ls? It would be useful to be able to work out these limitations in detail.

o6uoq commented 7 years ago

@fredrikaverpil in my experience, I found that rsync would break all the things but commands like cp worked fine.

Agreed, an understanding of these limitations as to why some commands have an issue and others do not would be helpful.

houglum commented 7 years ago

Long-ish explanation:

This looks to be due to how we handle recursive listing via the _RecurseExpandUrlAndPrint method in the LsHelper class:

We use a WildcardIterator to iterate over all the objects at a given "level" (level based on using the delimiter /), and if they have a / at the end, we recursively attempt to do this same expansion for the current item. In this case, the item's url is gs://BUCKET/PATH_TO/src/, and we create a URL to expand by simply adding an asterisk to the end. This ends up being gs://BUCKET/PATH_TO/src/*, and we repeat the process and get gs://BUCKET/PATH_TO/src/[knob projVar.shot]_renderCam_masterLayer/. When determining whether or not we should expand this URL using the same approach, we see that the prefix, src/[knob projVar.shot]_renderCam_masterLayer/, contains a wildcard. Herein lies the problem which we have the TODO for - differentiating user-supplied URLs versus programmatically/recursively generated URLs.

Possible workaround: If you're okay with receiving a simple flat listing of gs://BUCKET using the prefix PATH_TO/src/, you can take advantage of gsutil's "recursive" glob support:

gsutil ls gs://BUCKET/PATH_TO/src/**

as long as the prefix given doesn't contain a wildcard; e.g. this wouldn't be valid:

gsutil ls gs://BUCKET/PATH_TO/src/[knob\ projVar.shot]_renderCam_masterLayer/**

fredrikaverpil commented 7 years ago

Thanks @houglum for this explanation. Your suggested approach to use gsutil ls gs://BUCKET/PATH_TO/src/* works for me.

But I still find it very hard to foresee the exact behavior of e.g. gsutil cp vs gsutil rsync now, since they don't seem to both follow the same rules.

fredrikaverpil commented 7 years ago

I do feel I have hi-jacked this issue from @lihanli (sorry) and which was about gsutil cp. Hey @lihanli perhaps you can use gsutil rsync instead of gsutil cp?

devsterj commented 6 years ago

Is there no way to use gsutil cp to copy a file with square brackets in its name to a GCS bucket?

The command:

gsutil cp 'filename [with brackets]' 'gs://bucket/filename [with brackets]'

...seems to fail with a "Destination ... must match exactly 1 URL" error as of 2018/09. The gsutil help documentation could at least have a note about the quirky behaviour.

mfschwartz commented 6 years ago

As comments earlier in this thread noted, gsutil currently doesn't have a way to specify an object name containing wildcard characters - it interprets those characters as wildcards instead of treating them as the literal object name. We have talked about implementing a 'raw mode' that would make gsutil treat characters literally, but at present there is no such support, and there are no plans to add such support.

sharhalakis commented 6 years ago

FWIW, "gsutil du -s gs://bucket/" also fails to produce a sum if it contains a file containing a wildcard.

mschristianjohansson commented 3 months ago

I think all commands that are performed on a bucket that require listing of objects will fail when there is a single object in the bucket with a symbol of *,?,[,] would be nice if one could ignore unsupported objects

So my command:

gsutil cp -A "gs://MASKED/*/MASKED" .  

Fails with:

CommandException: Cloud folder gs://MASKED/MASKED/ contains a wildcard; gsutil does not currently support objects with wildcards in their name.

And not a single file downloads even though there are thousands which does not have unsupported filenames