Open mfschwartz opened 10 years ago
I've been storing all of our projects on this form: gs://bucket/proj1
, iterating them like so: proj2
, proj3
and so on. We're in the hundreds of these projects and I've been successful in operating on that using gsutil rsync
as well as gsutil ls
. In which cases should I be observant issues will arise?
To me it sounds like rsyncing proj1
would also rsync proj100
. Is this correct?
If so, is there any kind of workaround available to avoid rsyncing proj100
if my intention is to only rsync proj1
?
EDIT: I just tried this, and this command only downloads proj1
and not proj100
(which also exists on the bucket):
gsutil -m rsync -r gs://ir-projects/proj1 .
Makes me confused about what you wrote about the bucket/object27 example.
Would it be possible for you to announce here any such raw mode option when it becomes available in the beta of gsutil? Or how can I keep track of this?
/
, coming after "proj1", thus copying things with the prefix "proj1/" rather than "proj1".Great, many thanks.
Is there a workaround for this issue? I have hashtags in my filenames in S3, and I can't rsync to GS.
There isn't currently a workaround for objects with these problematic characters in their names. I've got issues with a bit higher priority to address before revisiting this, so I don't have an ETA at the moment, apologies.
Is there any progress on this? I always have about 1.5 million files with hashtags in them, which I would like to move using rsync
.
I've also run into this issue when trying to sync files from S3 into GCS. Filenames with #
in them in GCS cause the list operation to fail.
For example, if a file is named /folder_name/Logo #4.jpg
in GCS and S3 and I try to run rsync:
send: u'HEAD /folder_name/Logo%20?versionId=4.jpg HTTP/1.1\r\nHost:
Caught non-retryable exception while listing s3://bucket/folder_name/: BadRequestException: 400 None
Traceback (most recent call last):
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/commands/rsync.py", line 649, in _ListUrlRootFunc
out_file)
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/commands/rsync.py", line 899, in _BatchSort
current_chunk = sorted(islice(in_iter, buffer_size))
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/commands/rsync.py", line 730, in _FieldedListingIterator
for blr in iterator:
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/wildcard_iterator.py", line 512, in IterObjects
expand_top_level_buckets=True):
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/wildcard_iterator.py", line 237, in __iter__
fields=listing_fields):
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/boto_translation.py", line 466, in ListObjects
generation=generation)
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/boto_translation.py", line 1341, in _GetBotoKey
generation=generation)
File "/snap/google-cloud-sdk/129/platform/gsutil/gslib/boto_translation.py", line 1723, in _TranslateExceptionAndRaise
raise translated_exception # pylint: disable=raising-bad-type
BadRequestException: BadRequestException: 400 None
It seems like the default with rsync
should be to use a "raw" mode and transfer files based on their exact filename. If a list operation happens in GCS and gets Logo #4.jpg
, that should never trigger a list operation on S3 for Logo%20?versionId=4.jpg
.
Has anyone figured out a solution or workaround to this problem? It doesn't seem like this will be a thing anytime soon as the issue has been open for close to 8 years now
I am also interested in a "raw" feature. I want to use gsutil
with my GCS objects but I can't because they contain square brackets in their names.
Until raw mode is implemented, I have found a workaround that allows some commands to operate on files that have wildcards in their names.
Surround each wildcard character with square brackets. For example:
gsutil rm gs://bucket/x?*-file[1].jpg
gsutil rm gs://bucket/x[?][*]-file[[]1[]].jpg
This trick uses gsutil's pattern matching to match the desired wildcard characters. It seems to work with basic gsutil
commands like ls
, cp
, and rm
.
I try to use blob.from_string(gs_path, conn)
function. To get file data. but I received "Not found error" from system.
I think it is because of the following reason that gs_path including '#' charactor.
what should I do?
gsutil parses URIs on the command line for wildcards and version numbers. Because GCS doesn't reserve the wildcard and version delimiter chars from the object name charset, it's possible to construct URIs that don't work well with gsutil. For example, if you named an object "bucket/object**27", you couldn't use gsutil with this object name because gsutil would interpret the URI gs://bucket/object**27 as a wildcarding request, and match other object names in addition to this one. Similarly, if you create an object called "bucket/object#1360101917269000.2", attempting to use the URI gs://bucket/object#1360101917269000.2 with gsutil would cause gsutil to interpret it as the object gs://bucket/object with generation 1360101917269000 and metageneration 2.
To work around cases where this causes a problem we could implement a gsutil -r (top-level) command line option, that tells gsutil to use "raw" URI processing -- i.e., not to interpret any of the wildcard and version delimiter chars, and instead just to build object names literally from the passed string. gsutil couldn't support wildcarding or version-specific handling features with this mode, but at least a user could copy such files individually to/from GCS.