fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
869 stars 271 forks source link

Unable to create a directory #401

Open srib opened 3 years ago

srib commented 3 years ago

What happened: Trying to create a directory in a bucket and it silently fails.

What you expected to happen: Expect a key created in a bucket.

Minimal Complete Verifiable Example:

from s3fs import S3FileSystem
s3 = S3FileSystem(anon=False)
s3.makedirs('s3://my-bucket-name/testdir', exist_ok=False) #fails to create 
s3.makedirs('my-bucket-name/testdir', exist_ok=False) #fails to create

How do I create a directory in a bucket? Am I doing something wrong or is it a bug? I see conversation in #245 and I was wondering if you could please help explain what the intended behavior is.

Anything else we need to know?: s3fs version == 0.5.0 Relevant issue here: #245

Environment:

martindurant commented 3 years ago

This is a long-standing misunderstanding, and your help, @srib , would be greatly appreciated in clarifying the situation for other users.

s3 does not support directories, except the top-level buckets, which are fairly similar in their behaviour. If you expect s3.makedirs('my-bucket-name/testdir') to enable you to write files with paths that start with that string then: congratulations! There is in fact nothing for s3fs to do, the location was already writable.

Once you have a file like "my-bucket-name/testdir/testfile", the directory will appear to exist, without that file it will not. This is similar to trying to commit an empty directory to git.

Unfortunately, the situation is complicated by the behaviour of the s3 console, which has a "make directory" button, whose action is actually to create an empty file with the name of the directory you wanted. This is not part of the s3 API, just a convention, and the AWS CLI does not do this. If we were to do this, we end up making extra requests, and have to distinguish between real common preixes (directories implied by having contents) and empty placeholder files.

srib commented 3 years ago

Thank you for the response, @martindurant!

Unfortunately, the situation is complicated by the behaviour of the s3 console, which has a "make directory" button, whose action is actually to create an empty file with the name of the directory you wanted. This is not part of the s3 API, just a convention, and the AWS CLI does not do this. If we were to do this, we end up making extra requests, and have to distinguish between real common preixes (directories implied by having contents) and empty placeholder files.

I get it now. If that is the case, I understand the intended behavior. If that is the case, in case where s3.makedirs('my-bucket-name/testdir') does nothing, do you think it is acceptable to show a warning message?

spsoni commented 3 years ago

@martindurant AWS have a concept of S3 folders. Check https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html

I am working on a comprehensive Pull request to have it as a feature in s3fs.

Check https://github.com/spsoni/s3fs/tree/feature/issue_401_support_for_s3_directory for my WIP

martindurant commented 3 years ago

The document you reference makes very clear that this concept of folder is relevant only to the AWS console view of S3, and not inherent to S3 and not implemented/supported by the AWS CLI, AWS SDKs, or REST API.

I think that, for the sake of this repo, we want to maintain adherence to the the true nature of S3 rather than whatever the console does. However, this thing has come up frequently, and s3fs already does make some assumptions around the meaning of "/" - I agree that we should think clearly about how best to handle and describe folder operations.

spsoni commented 3 years ago

@martindurant AWS doc (important section) says following

The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder, for example examplekeyname/. You can't upload an object that has a key name with a trailing "/" character using the Amazon S3 console. However, you can upload objects that are named with a trailing "/" with the Amazon S3 API by using the AWS CLI, AWS SDKs, or REST API.

So it's not only a feature for AWS console but also a S3 core feature accessible via S3 API (CLI, SDK and REST API)

I agree we should not create pseudo-directories (intermediate ones), but if asked/expected to create an explicit folder object (without file objects), s3fs should honour s3.mkdir and offer the mechanism.

E.g. s3.mkdir("bucket/intermediate-dir/dir1")

Should create "intermediate-dir/dir1/" key in "bucket", but not the "intermediate-dir/" key. But s3.exists("intermediate-dir") returns True with or without trailing "/"

Two reasons why we should implement and adhere to S3 API for folders: 1) So when AWS says that it accepts keys ending with "/" as a folder, then s3fs should comply with the expectation. 2) s3. mkdir followed by s3.exists should return True from the API design point of view. And S3 API supports a mechanism to create and validate folder. You don't need to have objects inside this folder for its validation of existence. 3) Explicit folder key ending with "/" may not be an efficient use of S3, but it exists as a feature to create a place holder real folder object instead of common virtual folder objects

Expected behaviour:

s3.mkdir("bucket/dir") # creates key "dir/" in "bucket" s3.exists("bucket/dir") # True, with or without "/" s3.touch("bucket/dir/inner/afile") # this is allowed s3.exists("bucket/dir") # Should be True, even though there is no such key with or without "/"

martindurant commented 3 years ago

The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder, for example examplekeyname/. You can't upload an object that has a key name with a trailing "/" character using the Amazon S3 console. However, you can upload objects that are named with a trailing "/" with the Amazon S3 API by using the AWS CLI, AWS SDKs, or REST API.

So it's not only a feature for AWS console but also a S3 core feature accessible via S3 API (CLI, SDK and REST API)

I'm sorry, but the quoted text says the exact opposite of your interpretation: the special behaviour applies to the console only, and from CLI, SDK, REST, the "/" character is not special.

I also invite you to consider what you think the right behaviour ought to be in cases where:

s3. mkdir followed by s3.exists should return True

This point I agree is a problem, and there is an issue for this; but I think it's OK to say in documentation that mkdir is a no-op except for creating buckets. You could solve the issue by keeping tack of mkdirs within the instance, not touching the remote storage backend.

spsoni commented 3 years ago

Very thought-provoking scenarios.

That raises a fundamental question. Is the objective of the s3fs project is to wrap what AWS S3 API offers or we keep the API interface for s3fs compliant with a general-purpose file system like Linux/POSIX fs.

My understanding is that s3fs is just a wrapper around the aiobotocore (or eventually botocore) library and should offer what botocore S3 API offers and capable of.

I will try to answer with that assumption in mind. Remove "/bucket" from strings where I refer to the boto3 key below.

keys named "/bucket/inner" and "/bucket/inner/" both exist

"/bucket/inner" should be able to represent both file and directory from the s3fs API design point of view.

But if two keys exist "/bucket/inner" (file) and "/bucket/inner/" (directory) then

s3.isdir("/bucket/inner")  # and 
s3.isfile("/bucket/inner") # both returns True

And, explicit check for directory key

s3.isdir("/bucket/inner/")  # returns True 
s3.isfile("/bucket/inner/") # returns False

Deletion scenarios

s3.rm("/bucket/inner")  # deletes file key

s3.rmdir("/bucket/inner")   # and,
s3.rmdir("/bucket/inner/")  # deletes the directory

Creation scenarios

s3.touch("/bucket/inner")   # creates a file

s3.mkdir("/bucket/inner")   # and
s3.mkdir("/bucket/inner/")  # creates a directory with key: "/bucket/inner/"

keys named "/bucket/inner" and "/bucket/inner/stuff" both exist

"/bucket/inner" could be anything again (file and/or directory) And, "/bucket/inner/stuff" should be okay to co-exist in all of the scenarios for "/bucket/inner" (as a file and/or directory)

s3.touch("/bucker/inner")  # creates a file
s3.mkdir("/bucket/inner")  # creates a directory key as well (key: "/bucker/inner/")
s3.touch("/buckker/inner/stuff")  # creates a file (can try to create the directory ("stuff") as well

Now expected outcome for validation

s3.isfile("/bucket/inner")  # returns True
s3.isdir("/bucket/inner")   # return True
s3.isfile("/bucket/inner/") # returns False
s3.isdir("/bucket/inner/")  # returns True
s3.isfile("/bucker/inner/stuff")  # returns True

The most challenging part is the deletion

s3.rm("/bucker/inner")     # should delete the file key
s3.rmdir("/bucket/inner")  # should raise an exception that directory is not empty
s3.rmdir("/bucker/inner", recursive=True)  # should delete both keys "/bucker/inner/" and "/bucker/inner/stuff"

a key exists at "/bucket/inner/" which contains data and/or metadata

boto3 allows storing data for the "/bucket/inner/" key. Check the following output (highlighted directory content size)

aws cli long listing files:

% aws s3 ls s3://bucket/
                           PRE inner/
2021-01-07 19:00:01          0 inner
% aws s3 ls s3://bucket/inner/
2021-01-07 18:58:05          0 **(zero byte by default for directory content)**
2021-01-07 18:57:38          0 stuff
% aws s3 ls s3://bucket/inner/

boto3 object creation:

boto3_s3_client.put_object(Bucket='bucket', Key='inner1/', Body="sdf")
boto3_s3_client.put_object(Bucket='bucket', Key='inner1/stuff', Body="sdf")

aws cli long listing again:

% aws s3 ls s3://bucket/
                           PRE inner1/
% aws s3 ls s3://bucket/inner1/
2021-01-07 19:28:11          3 **(non-zero bytes for directory content)**
2021-01-07 19:28:47          3 stuff

You could solve the issue by keeping track of mkdirs within the instance, not touching the remote storage backend.

I would not make an assumption for end-user of the s3fs library. s3fs is just a wrapper library end of the day. And documenting that mkdir is a no-op breaks that rule of the wrapper library responsibility

martindurant commented 3 years ago

Thank you for thinking through some of the scenarios, and I hope you realise that there are more! The point is, whether you still to posix-on-s3 (requiring directories) or pure-s3 (the current), you expose holes in the mismatch and have to hope that users see things the same way you do.

boto3 allows storing data for the "/bucket/inner/" key

Yes, but is this a file, directory, or both?

The final and perhaps most compelling case for leaving things how they are, is that if mkdir is taken to mean "make the namespace below this writable" (as opposed to "create a directly-like object"), then the current implementation does this without a call to s3, which may be important. Libraries like parquet IO might call mkdir a lot, and this can hinder performance.

spsoni commented 3 years ago

So if the s3fs project's prime goal is to cater for high performance for libraries like parquet IO. In that case, it's better to keep it the way it is.

Ideally, there should be two versions of this project or distinguished by the constructor parameter (normal vs light version). That way for small projects where the expectation is to have a directory as the real object in s3, then normal version code can be used. Or for namesake call it anything distinguishing it from the current high-performance version.

martindurant commented 3 years ago

I agree with this, it would be nice to have the option of "console-style" behaviour with directory objects, even if it's not the default. I'm not sure it's easier to implement, though, since you still have to care about buckets that don't follow that convention.

e13h commented 3 years ago

s3 does not support directories, except the top-level buckets, which are fairly similar in their behaviour. If you expect s3.makedirs('my-bucket-name/testdir') to enable you to write files with paths that start with that string then: congratulations! There is in fact nothing for s3fs to do, the location was already writable.

@martindurant Would it make sense to either disable or raise a warning for mkdir, mkdirs, makedir, and makedirs? The takeaway for me from this issue is that these don't really do anything because of S3's object-store paradigm.

martindurant commented 3 years ago

That would make sense if mkdir were only called by end users, but other libraries call it internally. Since the intent there is "make sure that files can be written as children of this path", the co-op does exactly the right thing - and errors and warnings would just get in the way.

tasansal commented 3 years ago

@martindurant The silent "creation" of the key without an apparent directory causes fsspec to fail in certain cases as well. Not sure where would be the best way to fix this? Maybe the check, create kwargs are redundant in fsspec or, we need to change something here in s3fs. The check kwarg (to see if we can touch the path) for fsspec.FSMap expects the "directory" key to exist. There is a create kwarg to make it "create" that folder, so check can run in cases where it doesn't exist.

However, below is the current behavior. For root = 'my_bucket/dir_i_want/', if it doesn't exists we can pass create=True and it runs a s3fs.core.S3FileSystem.mkdir(root) call. However, when queried by s3fs.core.S3FileSystem.exists(root), just couple lines later, it still returns False. Hence it fails and tells us to pass create=True. check works as expected if we create the folder via the console.

This is all related to trying to debug weird behavior I've seen at zarr-developers/zarr-python/#814

martindurant commented 3 years ago

Possibly it's the exists call in FSMap that's not required. We nest test whether we can write something, that should be enough; or for read-only (which is not explicitly labeled anywhere) that we can list the resultant map.

tasansal commented 3 years ago

I agree, it sounds like the check and create in fsspec are redundant. I will open an issue there to discuss further.

joaoe commented 1 year ago

Howdy ! In my project I have a somewhat complex application that must transparently work over a local folder or a S3 bucket. One of the wayt ot acheive this was to use s3fs-fuse to mount a local file through the OS. However, I had some limitations running s3fs-fuse in an OpenShift container so I'm been looking at a pure python layer. Yet, S3FileSystem is quite limited to perfectly emulate the real file system.

  1. S3FileSystem inherits from fsspec.AsyncFileSystemand therefore is it advertising itself as a file system, which means folder support must be there. Yet, it is not and as such anyone working with S3FileSystem needs a lot of boilerplate to fix folder support.
  2. s3fs-fuse creates keys in the bucket ending with a slash / to signify it is a folder https://github.com/s3fs-fuse/s3fs-fuse/blob/f8a825e9b969a2e4cae89f86a4a47d7458ea7e81/src/s3fs.cpp#L1078
  3. I tried to fake that with S3FileSystem by appending a "/." or "/.folder" to create a stray file. However, I tried to monkeypatch a bit the code in S3FileSystem not to strip trailing slashes and the key was created and then recognized as s3fs-fuse as a folder.

So, I suggest S3Filesystem finally uses "path/" as a format to keep folders and thus improve compatibility with regular file systems which implement fsspec.AsyncFileSystem. This would apply to everything folder related.

martindurant commented 1 year ago

The problem you are facing is not really with s3fs, but with the fuse integration. Indeed, s3 does not have directories. You can do things to try to get around that (such as the zero-length files), but these are conventions circumventing the actual storage model and API. When using S3FileSystem directly, mkdir() always succeeds, in that you can write to a path within a directory (you always can).

The problem arise in trying to cd to an empty directory. But you can write to a path within a directory without first creating or changing to that directory.

joaoe commented 1 year ago

The problem you are facing is not really with s3fs, but with the fuse integration. Indeed, s3 does not have directories. You can do things to try to get around that (such as the zero-length files), but these are conventions circumventing the actual storage model and API. When using S3FileSystem directly, mkdir() always succeeds, in that you can write to a path within a directory (you always can).

Well, I think we all know that. But the problem is that S3FileSystem tells the world it is a file system and implements the fsspec interface, while actually not implementing file system operations with folders. So, this breaks the contract.

martindurant commented 1 year ago

It does the best that it can, given the limitations of the actual backend storage. mkdir() does work in the most important sense of making a path in which you can write files.Having a side effect of creating files seems like a bad idea and even less filesystem-like.

Danferno commented 6 hours ago

An issue I encountered with this is when combining simplecache and a blob-style backend (in my case, abfs) with a pandas to_parquet. I can't create a directory because mkdir just passes (no directories in blobs), but the to_parquet fails because the directory does not exist in the case. Any suggestions on how to solve this?