GoogleCloudPlatform / gcsfuse

A user-space file system for interacting with Google Cloud Storage
https://cloud.google.com/storage/docs/gcs-fuse
Apache License 2.0
2.05k stars 430 forks source link

Add optional implicit directory behavior #7

Closed jacobsa closed 9 years ago

jacobsa commented 9 years ago

As discussed in the semantics doc, we require objects to exist for directories as well as files; there is no such thing as an implicit directory. If an object named foo/bar exists but no object named foo/ exists, then the file system behaves as if foo/bar does not exist. So if the user mounts a bucket containing only an object named foo/bar and then does cat foo/bar, they will get a "file not found" error.

Issue

When the user does cat foo/bar, fuse sends the following requests to gcsfuse:

  1. Look up the name "foo" within the root inode. Return its inode ID and whether it's a file or a directory, or fail if it's non-existent. Call the returned inode F.
  2. Look up the name "bar" within F. Return its inode ID and whether it's a file or a directory. Call the inode B.
  3. Read the contents of B.

The fundamental issue is that at the point of the call in (1), gcsfuse can see that an object named foo doesn't exist and therefore can say "foo" doesn't refer to a file, but needs to decide between telling the kernel that it doesn't exist at all or telling the kernel that it refers to a directory.

Current behavior

The current behavior is that in (1) gcsfuse asks GCS to do a consistent read of the metadata for two objects, foo and foo/. If it finds the first it calls "foo" a file, if it finds the second it calls it a directory, and if it finds neither it says it doesn't exist. That's why we require the object foo/ to exist for the directory to appear to exist.

This method works because unlike listing objects by prefix, a read of the metadata for a single object is guaranteed to be fresh.

Alternatives

Listing-based lookup

One alternative is that we implement (1) by asking whether foo exists (as today) and by scanning objects with the prefix foo/, saying that "foo" is a directory if the scan is non-empty. But there are drawbacks here:

Even if GCS eventually offers list-your-own-writes consistency, negating the last point, the other issues remain.

Fixup tool

If users want to mount buckets where they've created object names assuming that implicit directories will work, we can create a "fixup" tool that lists the buckets and creates the appropriate objects for the implicit directories.

The main caveat here is that the tool would itself depend upon listing, so may miss some objects in the bucket for the same reasons discussed above. Another caveat is the need to run such a tool, but the behavior could be built into the gcsfuse binary itself (either as the default when mounting or on an opt in basis).

jacobsa commented 9 years ago

I currently lean toward treating this in the same way as the fast vs. consistent tradeoffs elsewhere in the product: make it configurable at mount time. (I know this is a cliche, but I don't have a better answer.)

Users with existing buckets with objects containing slash delimiters but no placeholder directory objects can have those objects show up, at the cost of flaky name lookups and weird directory existence behavior. Careful users who want to write software against the file system (including me when writing tests) can have sane and non-flaky behavior at the cost of potentially stranded objects.

More concretely: when serving a lookup, follow the current behavior. If the answer is "not found", optionally fall back to the listing path discussed above.

jacobsa commented 9 years ago

Decision with Jay: make it configurable, defaulting to the safe/non-flaky behavior, but with prominent advertising for the option.

jacobsa commented 9 years ago

Implementation notes:

jacobsa commented 9 years ago

Now that we've decided that conflicting file/directory names shouldn't cause one of the two to be hidden (cf. #28), I think we can't use the "fall back to listing only if the usual method fails" algorithm in LookUpInode. If we do, implicit directories will be hidden by files with conflicting names.

I think the logic should be:

Later, when #28 is worked on, the last part of the logic can be updated to deal with the \n sentinel thing discussed in that issue.

friday commented 7 years ago

I wasn't happy with the performance of --implicit dirs, so I wrote a bash oneliner to create missing dirs. You need to set up gsutil with your account first, and substitute the variables for your bucket name and mount point:

BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/*/*/**" | xargs dirname | sort | uniq | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"

It doesn't print any progress info, but it worked like a charm for me. Some unix flavors might need the arguments adjusted though.

Update: Changed "gs://$BUCKET/**" to "gs://$BUCKET/*/*/**" to fix a harmless issue detailed below.

MongoExpUser commented 6 years ago

Just:

Create bucket gsutil mb gs://myBucket/

Mount bucket
sudo gcsfuse myBucket/path-to-mount/folder

See google link for the create bucket part and other commands using gsutil: https://cloud.google.com/storage/docs/how-to

FossPrime commented 6 years ago

Seems to work in MacOS with bash 4. Function is optional

function fix-material() {
  BUCKET=mybucket
  MOUNTED_AT=/Users/mount
  gsutil ls "gs://${BUCKET}/**" | xargs -E '\n' -n1 dirname | sort | uniq | sed "s/gs:\/\/${BUCKET}\///" | xargs -I % mkdir -p "${MOUNTED_AT}/%"
}

Edit: I got an extraneous gs: folder made at the root of the bucket for some reason. Be sure to run find . -type d | wc -l with and without implicit dirs to verify. When I did it the numbers were equal, after deleting the random gs directory, which may have been placed there by my debugging attempts.

friday commented 6 years ago

@rayfoss I updated my comment changing the globbing pattern to skip the root level. dirname removes the last part, ie "fileorfolder" in "gs://my-bucket/subdirectory/fileorfolder" so if you had files or folders in the root level (like if you run the this twice) you would get a "gs:" folder. Pretty harmless issue though.

pkdetlefsen commented 6 years ago

@rayfoss @friday I had folders at root level that needed to be created as well so I used the "gs://$BUCKET/**" pattern and instead used tail -n +2 to remove the root folder entry from directory list.

BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/**" | xargs dirname | sort | uniq | tail -n +2 | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"

Thanks a lot for the oneliner!