Closed jacobsa closed 9 years ago
I currently lean toward treating this in the same way as the fast vs. consistent tradeoffs elsewhere in the product: make it configurable at mount time. (I know this is a cliche, but I don't have a better answer.)
Users with existing buckets with objects containing slash delimiters but no placeholder directory objects can have those objects show up, at the cost of flaky name lookups and weird directory existence behavior. Careful users who want to write software against the file system (including me when writing tests) can have sane and non-flaky behavior at the cost of potentially stranded objects.
More concretely: when serving a lookup, follow the current behavior. If the answer is "not found", optionally fall back to the listing path discussed above.
Decision with Jay: make it configurable, defaulting to the safe/non-flaky behavior, but with prominent advertising for the option.
Implementation notes:
Now that we've decided that conflicting file/directory names shouldn't cause one of the two to be hidden (cf. #28), I think we can't use the "fall back to listing only if the usual method fails" algorithm in LookUpInode. If we do, implicit directories will be hidden by files with conflicting names.
I think the logic should be:
ENOENT
.Later, when #28 is worked on, the last part of the logic can be updated to deal with the \n
sentinel thing discussed in that issue.
I wasn't happy with the performance of --implicit dirs, so I wrote a bash oneliner to create missing dirs. You need to set up gsutil with your account first, and substitute the variables for your bucket name and mount point:
BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/*/*/**" | xargs dirname | sort | uniq | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"
It doesn't print any progress info, but it worked like a charm for me. Some unix flavors might need the arguments adjusted though.
Update:
Changed "gs://$BUCKET/**"
to "gs://$BUCKET/*/*/**"
to fix a harmless issue detailed below.
Just:
Create bucket gsutil mb gs://myBucket/
Mount bucket
sudo gcsfuse myBucket/path-to-mount/folder
See google link for the create bucket part and other commands using gsutil: https://cloud.google.com/storage/docs/how-to
Seems to work in MacOS with bash 4. Function is optional
function fix-material() {
BUCKET=mybucket
MOUNTED_AT=/Users/mount
gsutil ls "gs://${BUCKET}/**" | xargs -E '\n' -n1 dirname | sort | uniq | sed "s/gs:\/\/${BUCKET}\///" | xargs -I % mkdir -p "${MOUNTED_AT}/%"
}
Edit:
I got an extraneous gs: folder made at the root of the bucket for some reason. Be sure to run find . -type d | wc -l
with and without implicit dirs to verify. When I did it the numbers were equal, after deleting the random gs directory, which may have been placed there by my debugging attempts.
@rayfoss I updated my comment changing the globbing pattern to skip the root level. dirname
removes the last part, ie "fileorfolder" in "gs://my-bucket/subdirectory/fileorfolder" so if you had files or folders in the root level (like if you run the this twice) you would get a "gs:" folder. Pretty harmless issue though.
@rayfoss @friday I had folders at root level that needed to be created as well so I used the "gs://$BUCKET/**"
pattern and instead used tail -n +2
to remove the root folder entry from directory list.
BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/**" | xargs dirname | sort | uniq | tail -n +2 | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"
Thanks a lot for the oneliner!
As discussed in the semantics doc, we require objects to exist for directories as well as files; there is no such thing as an implicit directory. If an object named
foo/bar
exists but no object namedfoo/
exists, then the file system behaves as iffoo/bar
does not exist. So if the user mounts a bucket containing only an object namedfoo/bar
and then doescat foo/bar
, they will get a "file not found" error.Issue
When the user does
cat foo/bar
, fuse sends the following requests to gcsfuse:The fundamental issue is that at the point of the call in (1), gcsfuse can see that an object named
foo
doesn't exist and therefore can say "foo" doesn't refer to a file, but needs to decide between telling the kernel that it doesn't exist at all or telling the kernel that it refers to a directory.Current behavior
The current behavior is that in (1) gcsfuse asks GCS to do a consistent read of the metadata for two objects,
foo
andfoo/
. If it finds the first it calls "foo" a file, if it finds the second it calls it a directory, and if it finds neither it says it doesn't exist. That's why we require the objectfoo/
to exist for the directory to appear to exist.This method works because unlike listing objects by prefix, a read of the metadata for a single object is guaranteed to be fresh.
Alternatives
Listing-based lookup
One alternative is that we implement (1) by asking whether
foo
exists (as today) and by scanning objects with the prefixfoo/
, saying that "foo" is a directory if the scan is non-empty. But there are drawbacks here:rm foo/bar
, suddenly it will appear as if the file system is completely empty. This is contrary to expectations, since the user hasn't donermdir foo
.rm foo/bar
thentouch foo/baz
, the second command will fail with a surprising "no such file or directory" error.foo/
maps down to an unbounded number of requests to GCS, since each response contains a continuation token that must be redeemed to continue scanning and GCS does only a limited amount of work before bailing out and returning this. This means a single simple path resolution may result in enormous expense.rm foo/bar
. As discussed above now the directory "foo" no longer exists because it was only implicitly defined, so the user gets the surprising behavior oftouch foo/baz
failing. Except they only get behavior once the listing catches up. Worse, if they try the experiment several times then it may fail, succeed, fail, succeed, and fail again.Even if GCS eventually offers list-your-own-writes consistency, negating the last point, the other issues remain.
Fixup tool
If users want to mount buckets where they've created object names assuming that implicit directories will work, we can create a "fixup" tool that lists the buckets and creates the appropriate objects for the implicit directories.
The main caveat here is that the tool would itself depend upon listing, so may miss some objects in the bucket for the same reasons discussed above. Another caveat is the need to run such a tool, but the behavior could be built into the gcsfuse binary itself (either as the default when mounting or on an opt in basis).