Open xmedeko opened 8 years ago
It's unlikely we'll add support for gsutil to create placeholder objects representing directories. gsutil tries to be compatible with these placeholder objects, but it does not create them because it complicates gsutil command semantics.
Well, maybe to the local disk - bucket operation you are right. But I think for the bucket - bucket operations the semantics is simple: treat directory placeholder as the common file object. If it's in the source bucket, then mv/copy it to the destination bucket.
Changing the semantics based on whether the destination is local or a bucket seems confusing to me, because it's only going halfway to preserving the fiction. We can't preserve the fiction when we copy locally. If we decide to preserve it when copying in the cloud, then it would also make sense to create placeholder objects when we copy from local to cloud, which we've explicitly decided not to do.
As an example of the kind of complication that could arise with this semantic, imagine that you are copying with a customer-supplied encryption key. Should directory placeholder objects then be guarded by the encryption key and inaccessible without it? I think this is hard to reason about.
I do not know how the gsutil works with the customer-supplied encryption key and probably other use-cases.
Just from my novice point of view: the behaviour of gsutil is confusing for me now, since is does not mirror the source bucket when doing bucket
- bucket
operations. E.g. when I make a backup of a bucket directory to a (another or same) bucket, and then restore the bucket, I have incomplete information - without explicit dirs. I.e. I would expect the directory placeholders objects to behave as a files with zero size.
IMO it's acceptable if the local disk
- bucket
does not create the explicit dirs. The bucket and the local disk are different storages, other infos are also not preserved (contentType, ACLs, ...). (Although some option, like -E
for creating explicit dirs would be nice, too.)
Although I do not know the customer-supplied encryption in gsutil, I would guess that the explicit dirs should be treated as other files, i.e. encrypted.
I agree with your point of view that the existing semantics cause confusion, in particular when you are using the browser UI and gsutil interchangeably. What I'm trying to point out is that I think it would be difficult to remedy this confusion completely; instead we'd trade for another potentially confusing set of semantics.
That being said, when you "back up" bucket to bucket and gsutil skips these placeholders, wouldn't restoring from the bucket to local with gsutil work fine even without the placeholder directories? Are you concerned that there are empty directories on your local filesystem that have important meaning?
Our particular reason for explicit directories is the gcsfuse
(see the first post), not a copy on a local fs. gcsfuse
cannot access the files without these explicit dirs. (It has the switch --implicit-dirs
, but this gcsfuse
mode is problematic.) Especially, the gcsfuse
cannot move-rename directories. So when I gsutil mv
a directory, then I cannot see it by the gcsfuse
.
Note. I know the gcsfuse
is not and never will be production ready, but it greatly speeds up and simplifies admin and maintenance tasks. E.g. using Linux find
tool etc. So we are trying to maintain explicit dirs on our bucket for occasional gcsfuse
access.
I'm still not in favor of making gsutil treat placeholder objects for bucket-to-bucket operations differently than for local-to-bucket or bucket-to-local operations, because causing these objects to have different meaning depending on the destination:
That said, I also think your desire to safely interoperate with gcsfuse is a reasonable one.
What would you think about a command (maybe in gsutil, but it make make more sense elsewhere) that takes a bucket as an input and creates placeholder objects for each "directory" that it finds? This wouldn't allow you to preserve empty directories, but presumably those are not of high importance.
Yep, a tool to create explicit dirs would be enough for local-to-bucket operations. I have already created such script gsmkdirs.sh
, but it's not optimal, it requires gcsfuse
. It would be perfect it is a part of gsutil
(somethink like gsutil mkdir
).
Thanks - leaving this issue open to consider post-hoc population of placeholder directory objects (and potentially gsutil mkdir
to create an empty directory placeholder).
not that it will be implemented, but this really is a missing feature, because anything copied to GCS from a local filesystem using gsutil
cannot actually be seen using gcsfuse
. we can set the --implicit-dirs
option for gcsfuse, but this results in unacceptable performance for us. If only there was a mode in gsutil that honored and maintained these directory placeholders for recursive cp
and rsync
with gsutil
(an after-the-fact pass is still useful, but less so)...
Weirdness of current semantics is also noticeable when used together with hadoop connector. While directory structure is preserved and fully understood by hadoop connector, gsutil is still better choice for gcs-to-gcs copy operations. While being the only (?) out-of-the box way for metadata-level copies, it becomes virtually unusable if it is required to preserve directory structure.
@thobrla So would you think it would be acceptable to add explicit flag like -E/--preserve-explicit-dirs during bucket-to-bucket operations?
I think adding --preserve-explicit-dirs would need to be supported bidirectionally. This is complicated because what it means for a GCS object to be a "dir" is not clear. What the expected behavior if I store data in an object ending with /
and then try to copy it locally? There are other edge cases as well.
I'm open to suggestions for resolving this, but without a cohesive design I think it's more appropriate to add the ability to populate placeholder dir objects post-hoc based on an object listing.
Just a quick note: I'm still open to listening to potential solutions, but even if we were able to agree on an approach to reduce some of the confusion around pseudo-directory semantics, it's unlikely that we (the gsutil team) would be able to implement that solution in the near future. Unfortunately, we've had quite a few high-priority work items come up in recent months, and we're having to deprioritize and put other items on hold as a result. (My intention isn't to put a damper on the conversation, but to set expectations on when we might see this fixed if we get to an agreeable solution.)
+1 on something built into gsutil. I can do it from the web console, wonder why doesn't the cli support?
very confusing, just followed the documentation at https://cloud.google.com/storage/docs/gsutil/commands/mv and gsutil mv gs://my_bucket/olddir gs://my_bucket/newdir won't rename the olddir to newdir
not to say moving a folder into another folder
Use this simple code to sync src directorie structure to dest folder
mimicDirectories() { local srcD="$1" local dstD="$2" local p="" local file=""
if [[ ! -d "$srcD" ]];then return fi
mkdir -p "$dstD"
local arr=($(find $srcD -maxdepth 1 -mindepth 1 -type d))
for ((p=0; p<${#arr[*]}; p++)) do file=$(basename ${arr[p]}) mkdir -p $dstD/$file if [ $? == 0 ];then mimicDirectories "${arr[p]}" "$dstD/$file" fi done }
any update on this? I am facing the same issue and by enabling --implicit-dirs
, there is a tradeoff in performance which I can not afford.
@vishivish18 just use the mimicDirectories() shell function before using gsutil..(https://github.com/GoogleCloudPlatform/gsutil/issues/388#issuecomment-387357551).. It is a recursive function which duplicates the src to dst directory structury......
gcsfuse is not an option is your running a Mac. The cask install fails with:
Error: gcsfuse has been disabled because it requires closed-source macFUSE!
And following the macOs installation instructions fails with:
installer: Error - The FUSE for macOS installation package is not compatible with this version of macOS.
Fails because I have an arm chip set. So I'm stuck. This should really be core functionality in gsutil. This functionality exists on Azure and AWS, so why not GCS?
Any plans on adding a fix for this? Adding a --include-explicit-dirs
would probably solve 95% of issues.
In my case, I am trying to rsync a bucket in my production env to a bucket in my test env. Aka:
export PRD="gs://jake-prd/dataset"
export TEST="gs://jake-test/dataset"
gsutil -m -d rsync -r $PRD $TEST
The codebase that I am using has some (overly ridgid) sanity checks that fail when the number of files are not as expected.
So this difference causes a big problem:
$ gsutil ls $TEST | head
gs://jake-test/dataset/f1.txt
gs://jake-test/dataset/f2.txt
$ gsutil ls $PRD | head
gs://jake-prd/dataset/
gs://jake-prd/dataset/f1.txt
gs://jake-prd/dataset/f2.txt
gcsfuse
requires explicit directories by default. So the bucket has object:When do
gsutils cp, mv, rsync
and I need to copy/move the explicit directory too. Reason: gsutil is faster thengcsfuse
and thegcsfuse
cannotmv
(rename) directories. So when I dogsutil mv gs://bucket/dir gs://bucket/newdir
, I got the objects:but I would expect
Similarly with
gsutils cp, rsync
- thenewdir/
explicit directory is not created.Note: as a workaround I have created a shell script to create explicit dirs
gsmkdirs.sh
.