GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
875 stars 334 forks source link

implement gsutil raw mode #220

Open mfschwartz opened 10 years ago

mfschwartz commented 10 years ago

gsutil parses URIs on the command line for wildcards and version numbers. Because GCS doesn't reserve the wildcard and version delimiter chars from the object name charset, it's possible to construct URIs that don't work well with gsutil. For example, if you named an object "bucket/object**27", you couldn't use gsutil with this object name because gsutil would interpret the URI gs://bucket/object**27 as a wildcarding request, and match other object names in addition to this one. Similarly, if you create an object called "bucket/object#1360101917269000.2", attempting to use the URI gs://bucket/object#1360101917269000.2 with gsutil would cause gsutil to interpret it as the object gs://bucket/object with generation 1360101917269000 and metageneration 2.

To work around cases where this causes a problem we could implement a gsutil -r (top-level) command line option, that tells gsutil to use "raw" URI processing -- i.e., not to interpret any of the wildcard and version delimiter chars, and instead just to build object names literally from the passed string. gsutil couldn't support wildcarding or version-specific handling features with this mode, but at least a user could copy such files individually to/from GCS.

fredrikaverpil commented 7 years ago

I've been storing all of our projects on this form: gs://bucket/proj1, iterating them like so: proj2, proj3 and so on. We're in the hundreds of these projects and I've been successful in operating on that using gsutil rsync as well as gsutil ls. In which cases should I be observant issues will arise?

To me it sounds like rsyncing proj1 would also rsync proj100. Is this correct? If so, is there any kind of workaround available to avoid rsyncing proj100 if my intention is to only rsync proj1?

EDIT: I just tried this, and this command only downloads proj1 and not proj100 (which also exists on the bucket):

gsutil -m rsync -r gs://ir-projects/proj1 .

Makes me confused about what you wrote about the bucket/object27 example.

Would it be possible for you to announce here any such raw mode option when it becomes available in the beta of gsutil? Or how can I keep track of this?

houglum commented 7 years ago
fredrikaverpil commented 7 years ago

Great, many thanks.

ghost commented 7 years ago

Is there a workaround for this issue? I have hashtags in my filenames in S3, and I can't rsync to GS.

houglum commented 7 years ago

There isn't currently a workaround for objects with these problematic characters in their names. I've got issues with a bit higher priority to address before revisiting this, so I don't have an ETA at the moment, apologies.

sebastianbk commented 5 years ago

Is there any progress on this? I always have about 1.5 million files with hashtags in them, which I would like to move using rsync.

jmreid commented 4 years ago

I've also run into this issue when trying to sync files from S3 into GCS. Filenames with # in them in GCS cause the list operation to fail.

For example, if a file is named /folder_name/Logo #4.jpg in GCS and S3 and I try to run rsync:

send: u'HEAD /folder_name/Logo%20?versionId=4.jpg HTTP/1.1\r\nHost: 
Caught non-retryable exception while listing s3://bucket/folder_name/: BadRequestException: 400 None
Traceback (most recent call last):
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/commands/rsync.py", line 649, in _ListUrlRootFunc
    out_file)
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/commands/rsync.py", line 899, in _BatchSort
    current_chunk = sorted(islice(in_iter, buffer_size))
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/commands/rsync.py", line 730, in _FieldedListingIterator
    for blr in iterator:
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/wildcard_iterator.py", line 512, in IterObjects
    expand_top_level_buckets=True):
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/wildcard_iterator.py", line 237, in __iter__
    fields=listing_fields):
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/boto_translation.py", line 466, in ListObjects
    generation=generation)
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/boto_translation.py", line 1341, in _GetBotoKey
    generation=generation)
  File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/boto_translation.py", line 1723, in _TranslateExceptionAndRaise
    raise translated_exception  # pylint: disable=raising-bad-type
BadRequestException: BadRequestException: 400 None

It seems like the default with rsync should be to use a "raw" mode and transfer files based on their exact filename. If a list operation happens in GCS and gets Logo #4.jpg, that should never trigger a list operation on S3 for Logo%20?versionId=4.jpg.

NotNoah commented 2 years ago

Has anyone figured out a solution or workaround to this problem? It doesn't seem like this will be a thing anytime soon as the issue has been open for close to 8 years now

sourcenouveau commented 2 years ago

I am also interested in a "raw" feature. I want to use gsutil with my GCS objects but I can't because they contain square brackets in their names.

sourcenouveau commented 2 years ago

Until raw mode is implemented, I have found a workaround that allows some commands to operate on files that have wildcards in their names.

Surround each wildcard character with square brackets. For example:

This trick uses gsutil's pattern matching to match the desired wildcard characters. It seems to work with basic gsutil commands like ls, cp, and rm.

minerba commented 1 year ago

I try to use blob.from_string(gs_path, conn) function. To get file data. but I received "Not found error" from system. I think it is because of the following reason that gs_path including '#' charactor. what should I do?