GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
876 stars 335 forks source link

gsutil work with explicit directories #388

Open xmedeko opened 8 years ago

xmedeko commented 8 years ago

gcsfuse requires explicit directories by default. So the bucket has object:

dir/
dir/a.txt

When do gsutils cp, mv, rsync and I need to copy/move the explicit directory too. Reason: gsutil is faster then gcsfuse and the gcsfuse cannot mv (rename) directories. So when I do gsutil mv gs://bucket/dir gs://bucket/newdir, I got the objects:

dir/
newdir/a.txt

but I would expect

newdir/
newdir/a.txt

Similarly with gsutils cp, rsync - the newdir/ explicit directory is not created.

Note: as a workaround I have created a shell script to create explicit dirs gsmkdirs.sh.

thobrla commented 8 years ago

It's unlikely we'll add support for gsutil to create placeholder objects representing directories. gsutil tries to be compatible with these placeholder objects, but it does not create them because it complicates gsutil command semantics.

xmedeko commented 8 years ago

Well, maybe to the local disk - bucket operation you are right. But I think for the bucket - bucket operations the semantics is simple: treat directory placeholder as the common file object. If it's in the source bucket, then mv/copy it to the destination bucket.

thobrla commented 8 years ago

Changing the semantics based on whether the destination is local or a bucket seems confusing to me, because it's only going halfway to preserving the fiction. We can't preserve the fiction when we copy locally. If we decide to preserve it when copying in the cloud, then it would also make sense to create placeholder objects when we copy from local to cloud, which we've explicitly decided not to do.

As an example of the kind of complication that could arise with this semantic, imagine that you are copying with a customer-supplied encryption key. Should directory placeholder objects then be guarded by the encryption key and inaccessible without it? I think this is hard to reason about.

xmedeko commented 8 years ago

I do not know how the gsutil works with the customer-supplied encryption key and probably other use-cases.

Just from my novice point of view: the behaviour of gsutil is confusing for me now, since is does not mirror the source bucket when doing bucket - bucket operations. E.g. when I make a backup of a bucket directory to a (another or same) bucket, and then restore the bucket, I have incomplete information - without explicit dirs. I.e. I would expect the directory placeholders objects to behave as a files with zero size.

IMO it's acceptable if the local disk - bucket does not create the explicit dirs. The bucket and the local disk are different storages, other infos are also not preserved (contentType, ACLs, ...). (Although some option, like -E for creating explicit dirs would be nice, too.)

Although I do not know the customer-supplied encryption in gsutil, I would guess that the explicit dirs should be treated as other files, i.e. encrypted.

thobrla commented 8 years ago

I agree with your point of view that the existing semantics cause confusion, in particular when you are using the browser UI and gsutil interchangeably. What I'm trying to point out is that I think it would be difficult to remedy this confusion completely; instead we'd trade for another potentially confusing set of semantics.

That being said, when you "back up" bucket to bucket and gsutil skips these placeholders, wouldn't restoring from the bucket to local with gsutil work fine even without the placeholder directories? Are you concerned that there are empty directories on your local filesystem that have important meaning?

xmedeko commented 8 years ago

Our particular reason for explicit directories is the gcsfuse (see the first post), not a copy on a local fs. gcsfuse cannot access the files without these explicit dirs. (It has the switch --implicit-dirs, but this gcsfuse mode is problematic.) Especially, the gcsfuse cannot move-rename directories. So when I gsutil mv a directory, then I cannot see it by the gcsfuse.

Note. I know the gcsfuse is not and never will be production ready, but it greatly speeds up and simplifies admin and maintenance tasks. E.g. using Linux find tool etc. So we are trying to maintain explicit dirs on our bucket for occasional gcsfuse access.

thobrla commented 8 years ago

I'm still not in favor of making gsutil treat placeholder objects for bucket-to-bucket operations differently than for local-to-bucket or bucket-to-local operations, because causing these objects to have different meaning depending on the destination:

  1. is a confusing semantic that is hard to explain
  2. adds further complexity to the already difficult task of honoring placeholder folders in gsutil

That said, I also think your desire to safely interoperate with gcsfuse is a reasonable one.

What would you think about a command (maybe in gsutil, but it make make more sense elsewhere) that takes a bucket as an input and creates placeholder objects for each "directory" that it finds? This wouldn't allow you to preserve empty directories, but presumably those are not of high importance.

xmedeko commented 8 years ago

Yep, a tool to create explicit dirs would be enough for local-to-bucket operations. I have already created such script gsmkdirs.sh, but it's not optimal, it requires gcsfuse. It would be perfect it is a part of gsutil (somethink like gsutil mkdir).

thobrla commented 8 years ago

Thanks - leaving this issue open to consider post-hoc population of placeholder directory objects (and potentially gsutil mkdir to create an empty directory placeholder).

smemsh commented 7 years ago

not that it will be implemented, but this really is a missing feature, because anything copied to GCS from a local filesystem using gsutil cannot actually be seen using gcsfuse. we can set the --implicit-dirs option for gcsfuse, but this results in unacceptable performance for us. If only there was a mode in gsutil that honored and maintained these directory placeholders for recursive cp and rsync with gsutil (an after-the-fact pass is still useful, but less so)...

chemikadze commented 6 years ago

Weirdness of current semantics is also noticeable when used together with hadoop connector. While directory structure is preserved and fully understood by hadoop connector, gsutil is still better choice for gcs-to-gcs copy operations. While being the only (?) out-of-the box way for metadata-level copies, it becomes virtually unusable if it is required to preserve directory structure.

@thobrla So would you think it would be acceptable to add explicit flag like -E/--preserve-explicit-dirs during bucket-to-bucket operations?

thobrla commented 6 years ago

I think adding --preserve-explicit-dirs would need to be supported bidirectionally. This is complicated because what it means for a GCS object to be a "dir" is not clear. What the expected behavior if I store data in an object ending with / and then try to copy it locally? There are other edge cases as well.

I'm open to suggestions for resolving this, but without a cohesive design I think it's more appropriate to add the ability to populate placeholder dir objects post-hoc based on an object listing.

houglum commented 6 years ago

Just a quick note: I'm still open to listening to potential solutions, but even if we were able to agree on an approach to reduce some of the confusion around pseudo-directory semantics, it's unlikely that we (the gsutil team) would be able to implement that solution in the near future. Unfortunately, we've had quite a few high-priority work items come up in recent months, and we're having to deprioritize and put other items on hold as a result. (My intention isn't to put a damper on the conversation, but to set expectations on when we might see this fixed if we get to an agreeable solution.)

rjain15 commented 6 years ago

+1 on something built into gsutil. I can do it from the web console, wonder why doesn't the cli support?

LawrenceMok commented 6 years ago

very confusing, just followed the documentation at https://cloud.google.com/storage/docs/gsutil/commands/mv and gsutil mv gs://my_bucket/olddir gs://my_bucket/newdir won't rename the olddir to newdir

not to say moving a folder into another folder

rishin458 commented 6 years ago

Use this simple code to sync src directorie structure to dest folder

mimicDirectories() { local srcD="$1" local dstD="$2" local p="" local file=""

if [[ ! -d "$srcD" ]];then return fi

mkdir -p "$dstD"

local arr=($(find $srcD -maxdepth 1 -mindepth 1 -type d))

for ((p=0; p<${#arr[*]}; p++)) do file=$(basename ${arr[p]}) mkdir -p $dstD/$file if [ $? == 0 ];then mimicDirectories "${arr[p]}" "$dstD/$file" fi done }

vishivish18 commented 6 years ago

any update on this? I am facing the same issue and by enabling --implicit-dirs, there is a tradeoff in performance which I can not afford.

rishin458 commented 6 years ago

@vishivish18 just use the mimicDirectories() shell function before using gsutil..(https://github.com/GoogleCloudPlatform/gsutil/issues/388#issuecomment-387357551).. It is a recursive function which duplicates the src to dst directory structury......

sonic1981 commented 2 years ago

gcsfuse is not an option is your running a Mac. The cask install fails with:

Error: gcsfuse has been disabled because it requires closed-source macFUSE!

And following the macOs installation instructions fails with:

installer: Error - The FUSE for macOS installation package is not compatible with this version of macOS.

Fails because I have an arm chip set. So I'm stuck. This should really be core functionality in gsutil. This functionality exists on Azure and AWS, so why not GCS?

JakeSummers commented 2 years ago

Any plans on adding a fix for this? Adding a --include-explicit-dirs would probably solve 95% of issues.

In my case, I am trying to rsync a bucket in my production env to a bucket in my test env. Aka:

export PRD="gs://jake-prd/dataset"
export TEST="gs://jake-test/dataset"
gsutil -m -d rsync -r $PRD $TEST

The codebase that I am using has some (overly ridgid) sanity checks that fail when the number of files are not as expected.

So this difference causes a big problem:

$ gsutil ls $TEST | head

gs://jake-test/dataset/f1.txt
gs://jake-test/dataset/f2.txt

$ gsutil ls $PRD | head

gs://jake-prd/dataset/
gs://jake-prd/dataset/f1.txt
gs://jake-prd/dataset/f2.txt