Open drorata opened 5 years ago
Related, and somehow confusing as well, compare s3.mkdir('my-bucket/foo')
and `s3.mkdir('my-bucket/bar/'). The former creates an empty file and the latter a directory.
s3 doesn't actually have directories, only files. mkdir probably shouldn't do anything, ever. I'm not sure what making a directory in the console really does.
It is not 100% correct.
Consider:
s3.mkdir('my-bucket/testing/with_trailing_slash/')
s3.mkdir('my-bucket/testing/without_trailing_slash')
Yields the following:
One is treated as a folder/directory and the other as a file.
I understand that there is also a bug in s3fs, and I'll look into it (or you are welcome to!).
My point is, perhaps mkdir should simply error when given anything longer than a bucket, since the real way to make a folder is to make a file within it.
I tried to replace mkdir
with:
def mkdir(self, path, acl="", **kwargs):
""" Make new bucket or empty key
Parameters
----------
acl: str
ACL to set when creating
region_name : str
region in which the bucket should be created
"""
acl = acl or self.s3_additional_kwargs.get('ACL', '')
path = path.rstrip('/') + '/'
self.touch(path, acl=acl, **kwargs)
But some tests failed and I am not sure why...
BTW, @martindurant , could it be it is related also to https://stackoverflow.com/q/53558319/671013 ?
I am not a fan of using touch
as you suggest, because this actually creates an empty file on the server.
It would be interesting to know what information is returned on a "directory" that you create via the web interface.
If I create foo
in the bucket's root (my-bucket
), then S3FileSystem().ls('my-bucket')
includes foo
in the returned list. However, both S3FileSystem().ls('my-bucket/foo')
and S3FileSystem().ls('my-bucket/foo/')
raise a `FileNotFoundError.
I'm not sure what you don't like about touch
. It is used by mkdir
and it is an expected behavior that touching a non-existing file creates an empty one. In the S3 case, I guess depending on trailing /
it is determined whether a new empty file or directory is created.
Indeed, mkdir
currently calls touch
, the question is whether it should. It would surprise most people that, when asking for the creation of a directory, you actually get a (empty) file. Of course, the remote system doesn't really have directories.
I do not seem to find the same as your test case on real S3:
>>> s3.mkdir('MDtemp/foo')
>>> s32.ls('MDtemp/foo') # other instance
['MDtemp/foo']
>>> s32.ls('MDtemp/foo/')
['MDtemp/foo']
>>> s32.exists('MDtemp/foo')
True
Question: why do you find you want to make directories as all?
As far as the S3 console looks like, it seems like there is a difference between directories and files. In particular, you can have an empty directory. As far as I understand, this is the case when an object's name ends with a /
.
I implement a logic where first I want to have a directory (and verify its name, location etc.) and only then add content to it using different methods.
I didn't understand the example you shared. Can you elaborate?
The example was, I think , what you were trying to do.
From your description, I finally understand what you mean, and I agree that your solution of simply ensuring the path ends with '/' is reasonable. If you submit the PR, we can figure out why things are failing.
See PR I started.
I am experiencing several issues, some of which may be related to the new code. Please forgive me if do not understand how s3fs operates, what is expected and what is not — the result is at times confusing. Here is a test case:
import s3fs
s3 = s3fs.S3FileSystem()
bucket = 'test-bucket'
s3.mkdir(bucket)
assert bucket in s3.ls('') # ok
s3.ls(bucket) # raises a FileNotFoundError
s3.mkdir(f'{bucket}/dir1')
s3.ls(bucket) # still raises a FileNotFoundError
s3.touch(f'{bucket}/file')
assert f'{bucket}/file' in s3.ls(bucket) # ok
assert f'{bucket}/dir1' in s3.ls(bucket) # ok
s3.mkdir(f'{bucket}/dir2')
assert f'{bucket}/dir2' in s3.ls(bucket) # raises AssertionError
assert f'{bucket}/dir2' in s3.ls(bucket, refresh=True) # ok
s3.ls(f'{bucket}', detail=True) # ok
# directories have StorageClass 'DIRECTORY' and names without a trailing slash
s3.ls(f'{bucket}/dir1') # seemingly not ok. should it return an empty list instead?
# returns ['test-bucket/dir1/']
s3.ls(f'{bucket}/dir1', detail=True) # seemingly not ok
# the directory has StorageClass 'STANDARD' and a name with a trailing slash
s3.mkdir(f'{bucket}/dir1/dir3')
assert f'{bucket}/dir1/dir3' in s3.ls(f'{bucket}/dir1') # raises AssertionError
assert f'{bucket}/dir1/dir3' in s3.ls(f'{bucket}/dir1', refresh=True) # ok
s3.ls(f'{bucket}/dir1') # seemingly not ok. should it return just dir3?
# returns ['test-bucket/dir1/', 'test-bucket/dir1/dir3']
Again, it's not entirely clear what the behaviour should be, since S3 simulates folders but it really a key-value store. Perhaps s3fs simply shouldn't attempt mkdir
... Similarly, it is not clear whether ls(f'{bucket}/dir1')
should return 'test-bucket/dir1/'
or not, since now a key with exactly that value does exist. Note that you can perfectly well have a file and a directory-key and also a prefix all with identical keys.
From your code above, it does appear that the current implementation is not calling invalidate_cache
at the right moment.
Martin, might it make sense to mimic AWS CLI behaviour? In case of listing an empty directory, it returns nothing.
And listing a directory with contents, returns only the contents, excluding the directory itself.
@dvukolov , if you can specify exactly the behaviour you expect in every case, and (even better) phrase it in API calls and make a test case, please do! Getting the logic right here, including when we decide that a directory is missing as opposed to empty and when to clear the cache is very tricky!
I was wrong thinking that examples from AWS CLI would make things easier. Its behaviour is non-trivial, e.g. for prefixes it displays different output based on whether the path ends with a trailing slash or not:
$ aws s3 ls bucket/directory
PRE directory/
$ aws s3 ls bucket/directory/
2019-01-06 04:30:47 0
2019-01-06 04:31:23 35146 file
If the goal is to create a unified interface, then the filesystem spec that you authored makes a lot more sense. It is better to implement a familiar standard library-like interface rather than tailor too much to the specifics of each storage. In case of ls()
I would expect the following, similar to os.listdir(), os.scandir() and pathlib.Path.iterdir():
@martindurant , if you think that might be reasonable, I'll try my best to implement it in code together with test cases.
There is a related outstanding issue apache/arrow#3286 with isfile() and isdir() in pyarrow.filesystem.S3FSWrapper. Is there a reason why these methods are implemented in the wrapper class and not in s3fs.S3FileSystem?
Is there a reason why these methods are implemented in the wrapper class and not in s3fs.S3FileSystem?
I suppose they were not interested in pushing a PR upstream?
Note that fsspec has both, and s3fs will derive from that spec at some point. Also, since an s3 path can be simultaneously a key and a prefix, it is possible for isdir(path)
and isfile(path)
to be simultaneously true.
Late me take a moment to elucidate the two incompatible models that we are dealing with here.
1) the underlying key/value model, as surfaced by the prefix/delimiter API. Here, there are no directory objects at all below bucket level, and "common prefixes" are only listed and interpreted as directories if they contain files. In other words, mkdir/rmdir should do nothing, and empty directories are impossible. This is what s3fs started with, because of the API it uses, and so exists(mkdir(path)) may be False.
2) a convention is used by the s3 console and CLI, that empty keys with names ending in "/" are directories. However, such a key is not necessary for there to be a directory, you can also see one just by having a key exist at a deeper level. If now deleting a directory and it's contents, it would be necessary to differentiate between those directories that have associated (empty) keys on the underlying system, and those that don't.
In both cases, you may or may not a path which is both a key (perhaps with content) and a prefix/directory. What should info(path) do? Since ls(path) == [info(path)] for a path that points to a file, should the key in such cases be included in ls or not?
Also in both cases, a new file at a deep path (many '/' characters) may require invalidating directory listings at every level up to the root, if those directories had been previously listed, and now would contain another entry. I don't think we do this - but the situation is quite rare.
On another note, https://github.com/dask/s3fs/pull/161 shows that coercing s3fs to the fsspec model may work quite nicely and reduce the lines of code. It does not answer the questions here (and the directory-related tests are most of the ones that fail).
Consider the following minimal example:
For which the output is:
Note that in the first case, as long as the folder
foo
is empty it can be listed but not detected bys3.exists
. In contrast, in the second case, when the trailing/
is removed when creating the directory,.exists
detects the folder even when it is empty.I also created empty folders directly from the S3 console, and still
.exists
couldn't find them.If find it an un-expected behavior. Is it a bug or a feature?