fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
864 stars 271 forks source link

Pagination problem when listing directories (1000 file limit) #279

Open rabernat opened 4 years ago

rabernat commented 4 years ago

I am prototyping an s3-compatible storage service called open storage network.

I have encountered a problem with how s3fs is listing directories which appears to be related to pagination. Basically, s3fs thinks there are only 1000 objects in the directory and refuses to even try to read objects that don't show up in this initial list.

import boto3
import s3fs
assert s3fs.__version__ == '0.4.0'

# read-only credentials to bucket, okay to share publicly 
access_key = "EL456I5ZRYB44RB6J7Q4"
secret_key = "QydNAjMWBTOLRjHiA36uMvhBvI4WeTxWYNJ5oaiP"
endpoint_url = "https://ncsa.osn.xsede.org"

# create boto client
s3 = boto3.client('s3',
                  aws_access_key_id=access_key,
                  aws_secret_access_key=secret_key,
                  endpoint_url=endpoint_url)
# verify credentials
assert s3.list_buckets()['Buckets'][0]['Name'] == 'Pangeo'

# list the bucket using recommend boto pagination technique
# https://boto3.amazonaws.com/v1/documentation/api/latest/guide/paginators.html#filtering-results
paginator = s3.get_paginator('list_objects')
operation_parameters = {'Bucket': 'Pangeo',
                        'Prefix': 'cm26_control_temp.zarray'}
page_iterator = paginator.paginate(**operation_parameters)
# the directory should have 2402 objects in it
for page in page_iterator:
    print(len(page['Contents']))
# > 1000
# > 1000
# > 402 
# Correctly finds all 2402 objects

print(page['Contents'][-1]['Key'])
# > 'cm26_control_temp.zarray/99.9.0.0'

# now try with s3fs
fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
                       client_kwargs={'endpoint_url': endpoint_url})

listing = fs.listdir('Pangeo/cm26_control_temp.zarray')
print(len(listing))
# > 1000

# try to read a file that did not make it into the list
with fs.open('Pangeo/cm26_control_temp.zarray/99.9.0.0') as f:
    pass
# > FileNotFoundError: Pangeo/cm26_control_temp.zarray/99.9.0.0

This feels very much like a bug in s3fs. (A somewhat similar issue was noted in https://github.com/dask/s3fs/issues/253#issuecomment-557516952, including the 1000 file limit.) In fact, I would identify two distinct bugs:

For the first issue, one possible hint could be that the aws cli also makes the same mistake:

aws s3 --profile osn-rw ls --recursive s3://Pangeo/cm26_control_temp.zarray/ | wc -l
# > 1000

So perhaps there is something in the metadata of the OSN service that is tricking the paginators in some circumstances.

This issue is rather important to Pangeo, as we are keen to get some accurate benchmarks on this new storage service. Help would be sincerely appreciated.

martindurant commented 4 years ago

According to the boto docs, list_objects_v2 (the method s3fs uses) and the list_objects variant both say that they list up to 1000 objects only, although the paginators docs suggest that the latter returns 1000 "at a time" (which I thought was the point). Both methods take a MaxKeys parameter which has no given default.

rabernat commented 4 years ago

Thanks for your reply. But I don't understand what to conclude from it. Do you think this is something that needs to be fixed in s3fs or not? The bottom line is that, using boto paginators, I am able to correctly list the objects, but with s3fs I am not. I'd be happy to make a PR if you can recommend a course of action.

martindurant commented 4 years ago

I would try swapping list_objects_v2 to list_object (you use the latter)

rabernat commented 4 years ago

You're hunch was correct. If I do paginator = s3.get_paginator('list_objects_v2'), it only gets the first 1000 results.

So this is somehow a problem with the API service?

martindurant commented 4 years ago

So this is somehow a problem with the API service?

I have no idea! I don't know why there are two versions in the first place. If the structure of what is returned by list_objects is the same, it should be simple to change the code. @jacobtomlinson , since you looked recently at the botocore API, do you have extra information?

rabernat commented 4 years ago

Ok, so I think it's a problem with the API service in OSN. Compare regular s3:

import boto3
from botocore import UNSIGNED
from botocore.client import Config

s3pub = boto3.client('s3', config=Config(signature_version=UNSIGNED))
resp = s3pub.list_objects_v2(Bucket='mur-sst', Prefix='zarr/analysed_sst')
print(list(resp.keys()))

gives

['ResponseMetadata',
 'IsTruncated',
 'Contents',
 'Name',
 'Prefix',
 'MaxKeys',
 'EncodingType',
 'KeyCount',
 'NextContinuationToken']

Now for OSN:

access_key = "EL456I5ZRYB44RB6J7Q4"
secret_key = "QydNAjMWBTOLRjHiA36uMvhBvI4WeTxWYNJ5oaiP"
endpoint_url = "https://ncsa.osn.xsede.org"
s3 = boto3.client('s3',
                  aws_access_key_id=access_key,
                  aws_secret_access_key=secret_key,
                  endpoint_url=endpoint_url)
resp = s3.list_objects_v2(Bucket='Pangeo', Prefix='cm26_control_temp.zarray',
                          ContinuationToken='string')
print(list(resp.keys()))

gives

['ResponseMetadata',
 'IsTruncated',
 'Contents',
 'Name',
 'Prefix',
 'MaxKeys',
 'EncodingType']

OSN does not return KeyCount and NextContinuationToken. In particular, the absence of NextContinuationToken makes it impossible to paginate the results.

jacobtomlinson commented 4 years ago

Hmm that seems a shame that OSN doesn't return those keys, they will certainly be needed for pagination. Is this an issue you can raise with them?

jdmaloney commented 4 years ago

@rabernat This issue got brought to my attention today (I'm one of the folks working on OSN). Ceph is being used as the backing store for the project and this seems to be an outstanding issue with the Rados Gateway implementation. I'm still looking to see if there are updates newer than 11 months ago, but the most recent information I've found so far is here

rabernat commented 4 years ago

Thanks @jdmaloney for your reply! While perhaps we could manage to work around this in s3fs (say, by creating an option to use list_objects rather than list_objects_v2), my strong preference would be for an upstream fix in ceph. But the timeline you referenced above is not encouraging. 😬

There is, however, a separate issue that I raised above that has nothing to do with OSN:

s3fs is incorrectly raising a FileNotFoundError when I try to open an existing object (likely related to caching)

Since we are using consolidated metadata for the zarr store, a directory listing should never be necessary. All the keys we need are known a-priori from the metadata. @martindurant -- is there a way to bypass the automatic listing / caching that s3fs is performing? The objects are there: s3fs just needs to let me read them, rather than believing its (incorrect) cache of the directory listing.

martindurant commented 4 years ago

I thought that was indeed the model - if the file is not already in the cache, a HEAD request is made. That might only be on master. Should be compared with what happens in gcsfs too.

rabernat commented 4 years ago

Works:

# download file with boto client
s3.download_file('Pangeo', 'cm26_control_temp.zarray/99.9.0.0', '/dev/null')

fails:

# download file with s3fs
fs.download('Pangeo/cm26_control_temp.zarray/99.9.0.0', '/dev/null')
# > FileNotFoundError: Pangeo/cm26_control_temp.zarray/99.9.0.0

s3fs version is '0.4.0'.

smishra commented 4 years ago

I too face the same problem as my directories have more than 3000 files, is there any work around?

martindurant commented 4 years ago

@smishra , are you using s3fs latest version or master? Perhaps we need a release.

smishra commented 4 years ago

I installed it yesterday (pip install s3fs) on my CentOS image. The version it shows: 0.4.2.

Name: s3fs Version: 0.4.2 Summary: Convenient Filesystem interface over S3 Home-page: http://github.com/dask/s3fs/

martindurant commented 4 years ago

Would you be willing to try with master?

smishra commented 4 years ago

Let me try. Thanks

smishra commented 4 years ago

I tried again after recreating my environment with 0.4.2 version and it seems to work in python REPL. I will integrate in my PySpark and see if it works. Looks like I have to invalidate the cache though.

mmgaggle commented 4 years ago

Ceph merged a ListObjecstV2 support over a year ago in this PR. The Ceph tracker issue linked above was worked around by the reporter by using ListObjects, once they noticed they had made a mistake -

Then tried to use the listobjects() function but I've made a mistake that I've used the Marker instead of the given NextMarker

If for whatever reason the Ceph cluster cannot be upgraded to a release that supports ListObjectsV2, then using ListObjects would be the workaround. If there are problems with ListObjects, then a Ceph tracker issue should be filed.

jdmaloney commented 4 years ago

@mmgaggle Sorry we didn't update this thread, the cluster was updated back in February and we confirmed with @rabernat that everything worked and was resolved. The cluster was one dot release behind where that patch got merged in, just our luck :)

mmgaggle commented 4 years ago

No worries, I had some colleagues bump into this same issue and there was confusion about what was the right thing to do. My comment was just as much about making sure other folks who stumble across this know what their options are. Glad to hear the cluster you were talking to got updated, and that you're in the clear! :)