fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
864 stars 271 forks source link

Empty files with trailing slash are sometimes treated as directories and sometimes treated as regular files #439

Open isidentical opened 3 years ago

isidentical commented 3 years ago
import boto3
from s3fs import S3FileSystem
from pprint  import pprint
TEST_AWS_S3_PORT = 5555
TEST_AWS_ENDPOINT_URL = f'http://127.0.0.1:{TEST_AWS_S3_PORT}/'

boto_client = boto3.client('s3', endpoint_url=TEST_AWS_ENDPOINT_URL)
fs = S3FileSystem(client_kwargs={'endpoint_url': TEST_AWS_ENDPOINT_URL})

boto_client.create_bucket(Bucket='test-bucket')
boto_client.put_object(
    Bucket='test-bucket', Key='empty-dir/', Body='',
)

pprint(fs.ls('test-bucket', detail=True))
pprint(fs.info('test-bucket/empty-dir/'))
print(fs.isdir('test-bucket/empty-dir/'))
print(fs.ls('test-bucket/empty-dir/'))

The code above first creates an empty file using that ends with a trailing slash. Then it tries to run s3fs's ls on the parent directory, which identifies that file as a directory;

[{'Key': 'test-bucket/empty-dir',
  'Size': 0,
  'StorageClass': 'DIRECTORY',
  'name': 'test-bucket/empty-dir',
  'size': 0,
  'type': 'directory'}]

Also the second and the third calls (info() and isdir()) claims it is a directory;

{'Key': 'test-bucket/empty-dir',
 'Size': 0,
 'StorageClass': 'DIRECTORY',
 'name': 'test-bucket/empty-dir',
 'size': 0,
 'type': 'directory'}
True

though when we try to do ls/walk etc it behaves like a file. The following is the result of .ls('bucket/empty-dir/');

['test-bucket/empty-dir/']

instead I would have expected it to return an empty list

mvashishtha commented 1 year ago

@martindurant today I was bitten by a similar issue in s3fs.core.S3FileSystem.isfile. I had an s3 bucket like the (currently existing) bucket modin-datasets and it had an empty file testing/ in it, i.e. an object at s3://modin-datasets/testing/. There were also objects like modin-datasets/testing/test_data.parquet.

When I list the contents of 'modin-datasets/testing/', I see my object at 'modin-datasets/testing/':

from fsspec.core import url_to_fs

fs, path = url_to_fs("s3://modin-datasets/testing/")
# this prints a list including 'modin-datasets/testing/',  'modin-datasets/testing/test_data.parquet', ...
fs.ls('modin-datasets/testing/')

but my filesystem doesn't recognize modin-datasets/testing/ as a file!

assert not fs.isfile('modin-datasets/testing/')

The consequence was that I spent a long time trying to debug why s3fs was trying to treat my directory as a file, until I finally realized it was just trying to open a file it correctly found, but then could no longer recognize as a file! Indeed, fs.open('modin-datasets/testing/').read() gives me valid contents, b''.

Is this a bug in s3fs? Is it a separate issue? How does it relate to #562?

martindurant commented 1 year ago

There are a few ideas in conflict with this kind of thing, where a file and directory have exactly the same name, including the trailing "/". This situation could not, of course, happen on a posix FS.

The ls method is designed to provide a list of outputs, and so the same name can appear twice, with different details. However, info only fetches one of these, and isfile/dir uses info.