Open srib opened 3 years ago
This is a long-standing misunderstanding, and your help, @srib , would be greatly appreciated in clarifying the situation for other users.
s3 does not support directories, except the top-level buckets, which are fairly similar in their behaviour. If you expect s3.makedirs('my-bucket-name/testdir')
to enable you to write files with paths that start with that string then: congratulations! There is in fact nothing for s3fs to do, the location was already writable.
Once you have a file like "my-bucket-name/testdir/testfile", the directory will appear to exist, without that file it will not. This is similar to trying to commit an empty directory to git.
Unfortunately, the situation is complicated by the behaviour of the s3 console, which has a "make directory" button, whose action is actually to create an empty file with the name of the directory you wanted. This is not part of the s3 API, just a convention, and the AWS CLI does not do this. If we were to do this, we end up making extra requests, and have to distinguish between real common preixes (directories implied by having contents) and empty placeholder files.
Thank you for the response, @martindurant!
Unfortunately, the situation is complicated by the behaviour of the s3 console, which has a "make directory" button, whose action is actually to create an empty file with the name of the directory you wanted. This is not part of the s3 API, just a convention, and the AWS CLI does not do this. If we were to do this, we end up making extra requests, and have to distinguish between real common preixes (directories implied by having contents) and empty placeholder files.
I get it now. If that is the case, I understand the intended behavior. If that is the case, in case where s3.makedirs('my-bucket-name/testdir')
does nothing, do you think it is acceptable to show a warning message?
@martindurant AWS have a concept of S3 folders. Check https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html
I am working on a comprehensive Pull request to have it as a feature in s3fs.
Check https://github.com/spsoni/s3fs/tree/feature/issue_401_support_for_s3_directory for my WIP
The document you reference makes very clear that this concept of folder is relevant only to the AWS console view of S3, and not inherent to S3 and not implemented/supported by the AWS CLI, AWS SDKs, or REST API.
s3.exists("bucket/dir") # False, with or without "/"
s3.touch("bucket/dir/inner/afile") # this is allowed
s3.exists("bucket/dir") # Should be True, even though there is no such key with or without "/"
I think that, for the sake of this repo, we want to maintain adherence to the the true nature of S3 rather than whatever the console does. However, this thing has come up frequently, and s3fs already does make some assumptions around the meaning of "/" - I agree that we should think clearly about how best to handle and describe folder operations.
@martindurant AWS doc (important section) says following
The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder, for example examplekeyname/. You can't upload an object that has a key name with a trailing "/" character using the Amazon S3 console. However, you can upload objects that are named with a trailing "/" with the Amazon S3 API by using the AWS CLI, AWS SDKs, or REST API.
So it's not only a feature for AWS console but also a S3 core feature accessible via S3 API (CLI, SDK and REST API)
I agree we should not create pseudo-directories (intermediate ones), but if asked/expected to create an explicit folder object (without file objects), s3fs should honour s3.mkdir
and offer the mechanism.
E.g.
s3.mkdir("bucket/intermediate-dir/dir1")
Should create "intermediate-dir/dir1/" key in "bucket", but not the "intermediate-dir/" key. But s3.exists("intermediate-dir")
returns True
with or without trailing "/"
Two reasons why we should implement and adhere to S3 API for folders:
1) So when AWS says that it accepts keys ending with "/" as a folder, then s3fs should comply with the expectation.
2) s3. mkdir
followed by s3.exists
should return True
from the API design point of view. And S3 API supports a mechanism to create and validate folder. You don't need to have objects inside this folder for its validation of existence.
3) Explicit folder key ending with "/" may not be an efficient use of S3, but it exists as a feature to create a place holder real folder object instead of common virtual folder objects
Expected behaviour:
s3.mkdir("bucket/dir") # creates key "dir/" in "bucket"
s3.exists("bucket/dir") # True, with or without "/"
s3.touch("bucket/dir/inner/afile") # this is allowed
s3.exists("bucket/dir") # Should be True, even though there is no such key with or without "/"
The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder, for example examplekeyname/. You can't upload an object that has a key name with a trailing "/" character using the Amazon S3 console. However, you can upload objects that are named with a trailing "/" with the Amazon S3 API by using the AWS CLI, AWS SDKs, or REST API.
So it's not only a feature for AWS console but also a S3 core feature accessible via S3 API (CLI, SDK and REST API)
I'm sorry, but the quoted text says the exact opposite of your interpretation: the special behaviour applies to the console only, and from CLI, SDK, REST, the "/" character is not special.
I also invite you to consider what you think the right behaviour ought to be in cases where:
s3. mkdir followed by s3.exists should return True
This point I agree is a problem, and there is an issue for this; but I think it's OK to say in documentation that mkdir
is a no-op except for creating buckets. You could solve the issue by keeping tack of mkdirs within the instance, not touching the remote storage backend.
Very thought-provoking scenarios.
That raises a fundamental question. Is the objective of the s3fs project is to wrap what AWS S3 API offers or we keep the API interface for s3fs compliant with a general-purpose file system like Linux/POSIX fs.
My understanding is that s3fs is just a wrapper around the aiobotocore (or eventually botocore) library and should offer what botocore S3 API offers and capable of.
I will try to answer with that assumption in mind. Remove "/bucket" from strings where I refer to the boto3 key
below.
keys named "/bucket/inner" and "/bucket/inner/" both exist
"/bucket/inner" should be able to represent both file and directory from the s3fs API design point of view.
But if two keys exist "/bucket/inner" (file) and "/bucket/inner/" (directory) then
s3.isdir("/bucket/inner") # and
s3.isfile("/bucket/inner") # both returns True
And, explicit check for directory key
s3.isdir("/bucket/inner/") # returns True
s3.isfile("/bucket/inner/") # returns False
Deletion scenarios
s3.rm("/bucket/inner") # deletes file key
s3.rmdir("/bucket/inner") # and,
s3.rmdir("/bucket/inner/") # deletes the directory
Creation scenarios
s3.touch("/bucket/inner") # creates a file
s3.mkdir("/bucket/inner") # and
s3.mkdir("/bucket/inner/") # creates a directory with key: "/bucket/inner/"
keys named "/bucket/inner" and "/bucket/inner/stuff" both exist
"/bucket/inner" could be anything again (file and/or directory) And, "/bucket/inner/stuff" should be okay to co-exist in all of the scenarios for "/bucket/inner" (as a file and/or directory)
s3.touch("/bucker/inner") # creates a file
s3.mkdir("/bucket/inner") # creates a directory key as well (key: "/bucker/inner/")
s3.touch("/buckker/inner/stuff") # creates a file (can try to create the directory ("stuff") as well
Now expected outcome for validation
s3.isfile("/bucket/inner") # returns True
s3.isdir("/bucket/inner") # return True
s3.isfile("/bucket/inner/") # returns False
s3.isdir("/bucket/inner/") # returns True
s3.isfile("/bucker/inner/stuff") # returns True
The most challenging part is the deletion
s3.rm("/bucker/inner") # should delete the file key
s3.rmdir("/bucket/inner") # should raise an exception that directory is not empty
s3.rmdir("/bucker/inner", recursive=True) # should delete both keys "/bucker/inner/" and "/bucker/inner/stuff"
a key exists at "/bucket/inner/" which contains data and/or metadata
boto3 allows storing data for the "/bucket/inner/" key. Check the following output (highlighted directory content size)
aws cli long listing files:
% aws s3 ls s3://bucket/
PRE inner/
2021-01-07 19:00:01 0 inner
% aws s3 ls s3://bucket/inner/
2021-01-07 18:58:05 0 **(zero byte by default for directory content)**
2021-01-07 18:57:38 0 stuff
% aws s3 ls s3://bucket/inner/
boto3 object creation:
boto3_s3_client.put_object(Bucket='bucket', Key='inner1/', Body="sdf")
boto3_s3_client.put_object(Bucket='bucket', Key='inner1/stuff', Body="sdf")
aws cli long listing again:
% aws s3 ls s3://bucket/
PRE inner1/
% aws s3 ls s3://bucket/inner1/
2021-01-07 19:28:11 3 **(non-zero bytes for directory content)**
2021-01-07 19:28:47 3 stuff
You could solve the issue by keeping track of mkdirs within the instance, not touching the remote storage backend.
I would not make an assumption for end-user of the s3fs library. s3fs is just a wrapper library end of the day. And documenting that mkdir is a no-op breaks that rule of the wrapper library responsibility
Thank you for thinking through some of the scenarios, and I hope you realise that there are more! The point is, whether you still to posix-on-s3 (requiring directories) or pure-s3 (the current), you expose holes in the mismatch and have to hope that users see things the same way you do.
boto3 allows storing data for the "/bucket/inner/" key
Yes, but is this a file, directory, or both?
The final and perhaps most compelling case for leaving things how they are, is that if mkdir is taken to mean "make the namespace below this writable" (as opposed to "create a directly-like object"), then the current implementation does this without a call to s3, which may be important. Libraries like parquet IO might call mkdir a lot, and this can hinder performance.
So if the s3fs project's prime goal is to cater for high performance for libraries like parquet IO. In that case, it's better to keep it the way it is.
Ideally, there should be two versions of this project or distinguished by the constructor parameter (normal vs light version). That way for small projects where the expectation is to have a directory as the real object in s3, then normal version code can be used. Or for namesake call it anything distinguishing it from the current high-performance version.
I agree with this, it would be nice to have the option of "console-style" behaviour with directory objects, even if it's not the default. I'm not sure it's easier to implement, though, since you still have to care about buckets that don't follow that convention.
s3 does not support directories, except the top-level buckets, which are fairly similar in their behaviour. If you expect s3.makedirs('my-bucket-name/testdir') to enable you to write files with paths that start with that string then: congratulations! There is in fact nothing for s3fs to do, the location was already writable.
@martindurant Would it make sense to either disable or raise a warning for mkdir
, mkdirs
, makedir
, and makedirs
? The takeaway for me from this issue is that these don't really do anything because of S3's object-store paradigm.
That would make sense if mkdir were only called by end users, but other libraries call it internally. Since the intent there is "make sure that files can be written as children of this path", the co-op does exactly the right thing - and errors and warnings would just get in the way.
@martindurant
The silent "creation" of the key without an apparent directory causes fsspec
to fail in certain cases as well.
Not sure where would be the best way to fix this? Maybe the check
, create
kwargs are redundant in fsspec
or, we need to change something here in s3fs
.
The check
kwarg (to see if we can touch the path) for fsspec.FSMap
expects the "directory" key to exist.
There is a create
kwarg to make it "create" that folder, so check
can run in cases where it doesn't exist.
However, below is the current behavior.
For root = 'my_bucket/dir_i_want/'
, if it doesn't exists we can pass create=True
and it runs a s3fs.core.S3FileSystem.mkdir(root)
call. However, when queried by s3fs.core.S3FileSystem.exists(root)
, just couple lines later, it still returns False
. Hence it fails and tells us to pass create=True
.
check
works as expected if we create the folder via the console.
This is all related to trying to debug weird behavior I've seen at zarr-developers/zarr-python/#814
Possibly it's the exists
call in FSMap that's not required. We nest test whether we can write something, that should be enough; or for read-only (which is not explicitly labeled anywhere) that we can list the resultant map.
I agree, it sounds like the check
and create
in fsspec
are redundant. I will open an issue there to discuss further.
Howdy ! In my project I have a somewhat complex application that must transparently work over a local folder or a S3 bucket. One of the wayt ot acheive this was to use s3fs-fuse to mount a local file through the OS. However, I had some limitations running s3fs-fuse in an OpenShift container so I'm been looking at a pure python layer. Yet, S3FileSystem is quite limited to perfectly emulate the real file system.
So, I suggest S3Filesystem finally uses "path/" as a format to keep folders and thus improve compatibility with regular file systems which implement fsspec.AsyncFileSystem. This would apply to everything folder related.
The problem you are facing is not really with s3fs, but with the fuse integration. Indeed, s3 does not have directories. You can do things to try to get around that (such as the zero-length files), but these are conventions circumventing the actual storage model and API. When using S3FileSystem directly, mkdir() always succeeds, in that you can write to a path within a directory (you always can).
The problem arise in trying to cd
to an empty directory. But you can write to a path within a directory without first creating or changing to that directory.
The problem you are facing is not really with s3fs, but with the fuse integration. Indeed, s3 does not have directories. You can do things to try to get around that (such as the zero-length files), but these are conventions circumventing the actual storage model and API. When using S3FileSystem directly, mkdir() always succeeds, in that you can write to a path within a directory (you always can).
Well, I think we all know that. But the problem is that S3FileSystem tells the world it is a file system and implements the fsspec interface, while actually not implementing file system operations with folders. So, this breaks the contract.
It does the best that it can, given the limitations of the actual backend storage. mkdir() does work in the most important sense of making a path in which you can write files.Having a side effect of creating files seems like a bad idea and even less filesystem-like.
An issue I encountered with this is when combining simplecache
and a blob-style backend (in my case, abfs) with a pandas to_parquet
. I can't create a directory because mkdir
just passes (no directories in blobs), but the to_parquet
fails because the directory does not exist in the case. Any suggestions on how to solve this?
If you really want to, you can
fs.touch("bucket/path/")
# or
fs.touch("bucket/path/content")
This is not a directory, but depending on what's doing the checking, may be good enough for you.
What happened: Trying to create a directory in a bucket and it silently fails.
What you expected to happen: Expect a key created in a bucket.
Minimal Complete Verifiable Example:
How do I create a directory in a bucket? Am I doing something wrong or is it a bug? I see conversation in #245 and I was wondering if you could please help explain what the intended behavior is.
Anything else we need to know?: s3fs version == 0.5.0 Relevant issue here: #245
Environment: