fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
877 stars 272 forks source link

In consistent behavior of existence check for empty folders #156

Open drorata opened 5 years ago

drorata commented 5 years ago

Consider the following minimal example:

import moto
import boto3
import s3fs

with moto.mock_s3():
    print("Case 1")
    conn = boto3.client('s3', 'us-west-2')
    conn.create_bucket(Bucket='test-bucket')

    s3 = s3fs.S3FileSystem()

    s3.mkdir('test-bucket/foo/')
    print(s3.ls('test-bucket'))
    print(s3.exists('test-bucket/foo/'))
    s3.touch('test-bucket/foo/bar')
    print(s3.exists('test-bucket/foo/'))

with moto.mock_s3():
    print("Case 2")
    conn = boto3.client('s3', 'us-west-2')
    conn.create_bucket(Bucket='test-bucket')

    s3 = s3fs.S3FileSystem()

    s3.mkdir('test-bucket/foo')  # A missing trailing / is the only difference
    print(s3.ls('test-bucket'))
    print(s3.exists('test-bucket/foo/'))
    s3.touch('test-bucket/foo/bar')
    print(s3.exists('test-bucket/foo/'))

For which the output is:

Case 1
['test-bucket/foo']
False
True
Case 2
['test-bucket/foo']
True
True

Note that in the first case, as long as the folder foo is empty it can be listed but not detected by s3.exists. In contrast, in the second case, when the trailing / is removed when creating the directory, .exists detects the folder even when it is empty.

I also created empty folders directly from the S3 console, and still .exists couldn't find them.

If find it an un-expected behavior. Is it a bug or a feature?

drorata commented 5 years ago

Related, and somehow confusing as well, compare s3.mkdir('my-bucket/foo') and `s3.mkdir('my-bucket/bar/'). The former creates an empty file and the latter a directory.

martindurant commented 5 years ago

s3 doesn't actually have directories, only files. mkdir probably shouldn't do anything, ever. I'm not sure what making a directory in the console really does.

drorata commented 5 years ago

It is not 100% correct.

Consider:

s3.mkdir('my-bucket/testing/with_trailing_slash/')
s3.mkdir('my-bucket/testing/without_trailing_slash')

Yields the following:

screenshot 2018-11-30 at 14 38 20

One is treated as a folder/directory and the other as a file.

martindurant commented 5 years ago

I understand that there is also a bug in s3fs, and I'll look into it (or you are welcome to!).

My point is, perhaps mkdir should simply error when given anything longer than a bucket, since the real way to make a folder is to make a file within it.

drorata commented 5 years ago

I tried to replace mkdir with:

def mkdir(self, path, acl="", **kwargs):
        """ Make new bucket or empty key

        Parameters
        ----------
        acl: str
            ACL to set when creating
        region_name : str
            region in which the bucket should be created
        """
        acl = acl or self.s3_additional_kwargs.get('ACL', '')
        path = path.rstrip('/') + '/'
        self.touch(path, acl=acl, **kwargs)

But some tests failed and I am not sure why...

drorata commented 5 years ago

BTW, @martindurant , could it be it is related also to https://stackoverflow.com/q/53558319/671013 ?

martindurant commented 5 years ago

I am not a fan of using touch as you suggest, because this actually creates an empty file on the server. It would be interesting to know what information is returned on a "directory" that you create via the web interface.

drorata commented 5 years ago

If I create foo in the bucket's root (my-bucket), then S3FileSystem().ls('my-bucket') includes foo in the returned list. However, both S3FileSystem().ls('my-bucket/foo') and S3FileSystem().ls('my-bucket/foo/') raise a `FileNotFoundError.

drorata commented 5 years ago

I'm not sure what you don't like about touch. It is used by mkdir and it is an expected behavior that touching a non-existing file creates an empty one. In the S3 case, I guess depending on trailing / it is determined whether a new empty file or directory is created.

martindurant commented 5 years ago

Indeed, mkdir currently calls touch, the question is whether it should. It would surprise most people that, when asking for the creation of a directory, you actually get a (empty) file. Of course, the remote system doesn't really have directories.

I do not seem to find the same as your test case on real S3:

>>> s3.mkdir('MDtemp/foo')
>>> s32.ls('MDtemp/foo')  # other instance
['MDtemp/foo']
>>> s32.ls('MDtemp/foo/')
['MDtemp/foo']
>>> s32.exists('MDtemp/foo')
True

Question: why do you find you want to make directories as all?

drorata commented 5 years ago

As far as the S3 console looks like, it seems like there is a difference between directories and files. In particular, you can have an empty directory. As far as I understand, this is the case when an object's name ends with a /.

I implement a logic where first I want to have a directory (and verify its name, location etc.) and only then add content to it using different methods.

I didn't understand the example you shared. Can you elaborate?

martindurant commented 5 years ago

The example was, I think , what you were trying to do.

From your description, I finally understand what you mean, and I agree that your solution of simply ensuring the path ends with '/' is reasonable. If you submit the PR, we can figure out why things are failing.

drorata commented 5 years ago

See PR I started.

dvukolov commented 5 years ago

I am experiencing several issues, some of which may be related to the new code. Please forgive me if do not understand how s3fs operates, what is expected and what is not — the result is at times confusing. Here is a test case:

import s3fs
s3 = s3fs.S3FileSystem()
bucket = 'test-bucket'

s3.mkdir(bucket)
assert bucket in s3.ls('')  # ok
s3.ls(bucket)  # raises a FileNotFoundError

s3.mkdir(f'{bucket}/dir1')
s3.ls(bucket)    # still raises a FileNotFoundError

s3.touch(f'{bucket}/file')
assert f'{bucket}/file' in s3.ls(bucket)  # ok
assert f'{bucket}/dir1' in s3.ls(bucket)  # ok

s3.mkdir(f'{bucket}/dir2')
assert f'{bucket}/dir2' in s3.ls(bucket)  # raises AssertionError
assert f'{bucket}/dir2' in s3.ls(bucket, refresh=True)  # ok

s3.ls(f'{bucket}', detail=True)  # ok
# directories have StorageClass 'DIRECTORY' and names without a trailing slash

s3.ls(f'{bucket}/dir1')  # seemingly not ok. should it return an empty list instead?
# returns ['test-bucket/dir1/']

s3.ls(f'{bucket}/dir1', detail=True)  # seemingly not ok
# the directory has StorageClass 'STANDARD' and a name with a trailing slash

s3.mkdir(f'{bucket}/dir1/dir3')
assert f'{bucket}/dir1/dir3' in s3.ls(f'{bucket}/dir1')  # raises AssertionError
assert f'{bucket}/dir1/dir3' in s3.ls(f'{bucket}/dir1', refresh=True)  # ok

s3.ls(f'{bucket}/dir1')  # seemingly not ok. should it return just dir3?
# returns ['test-bucket/dir1/', 'test-bucket/dir1/dir3']
martindurant commented 5 years ago

Again, it's not entirely clear what the behaviour should be, since S3 simulates folders but it really a key-value store. Perhaps s3fs simply shouldn't attempt mkdir... Similarly, it is not clear whether ls(f'{bucket}/dir1') should return 'test-bucket/dir1/' or not, since now a key with exactly that value does exist. Note that you can perfectly well have a file and a directory-key and also a prefix all with identical keys.

From your code above, it does appear that the current implementation is not calling invalidate_cache at the right moment.

dvukolov commented 5 years ago

Martin, might it make sense to mimic AWS CLI behaviour? In case of listing an empty directory, it returns nothing.

dvukolov commented 5 years ago

And listing a directory with contents, returns only the contents, excluding the directory itself.

martindurant commented 5 years ago

@dvukolov , if you can specify exactly the behaviour you expect in every case, and (even better) phrase it in API calls and make a test case, please do! Getting the logic right here, including when we decide that a directory is missing as opposed to empty and when to clear the cache is very tricky!

dvukolov commented 5 years ago

I was wrong thinking that examples from AWS CLI would make things easier. Its behaviour is non-trivial, e.g. for prefixes it displays different output based on whether the path ends with a trailing slash or not:

$ aws s3 ls bucket/directory
                           PRE directory/
$ aws s3 ls bucket/directory/
2019-01-06 04:30:47          0
2019-01-06 04:31:23      35146 file

If the goal is to create a unified interface, then the filesystem spec that you authored makes a lot more sense. It is better to implement a familiar standard library-like interface rather than tailor too much to the specifics of each storage. In case of ls() I would expect the following, similar to os.listdir(), os.scandir() and pathlib.Path.iterdir():

@martindurant , if you think that might be reasonable, I'll try my best to implement it in code together with test cases.

dvukolov commented 5 years ago

There is a related outstanding issue apache/arrow#3286 with isfile() and isdir() in pyarrow.filesystem.S3FSWrapper. Is there a reason why these methods are implemented in the wrapper class and not in s3fs.S3FileSystem?

martindurant commented 5 years ago

Is there a reason why these methods are implemented in the wrapper class and not in s3fs.S3FileSystem?

I suppose they were not interested in pushing a PR upstream?

Note that fsspec has both, and s3fs will derive from that spec at some point. Also, since an s3 path can be simultaneously a key and a prefix, it is possible for isdir(path) and isfile(path) to be simultaneously true.

martindurant commented 5 years ago

Late me take a moment to elucidate the two incompatible models that we are dealing with here.

1) the underlying key/value model, as surfaced by the prefix/delimiter API. Here, there are no directory objects at all below bucket level, and "common prefixes" are only listed and interpreted as directories if they contain files. In other words, mkdir/rmdir should do nothing, and empty directories are impossible. This is what s3fs started with, because of the API it uses, and so exists(mkdir(path)) may be False.

2) a convention is used by the s3 console and CLI, that empty keys with names ending in "/" are directories. However, such a key is not necessary for there to be a directory, you can also see one just by having a key exist at a deeper level. If now deleting a directory and it's contents, it would be necessary to differentiate between those directories that have associated (empty) keys on the underlying system, and those that don't.

In both cases, you may or may not a path which is both a key (perhaps with content) and a prefix/directory. What should info(path) do? Since ls(path) == [info(path)] for a path that points to a file, should the key in such cases be included in ls or not?

Also in both cases, a new file at a deep path (many '/' characters) may require invalidating directory listings at every level up to the root, if those directories had been previously listed, and now would contain another entry. I don't think we do this - but the situation is quite rare.

On another note, https://github.com/dask/s3fs/pull/161 shows that coercing s3fs to the fsspec model may work quite nicely and reduce the lines of code. It does not answer the questions here (and the directory-related tests are most of the ones that fail).