fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
892 stars 274 forks source link

BUG: Explicit parsing of AWS S3 endpoint results in inconsistent behavior for GET #400

Open MatinF opened 4 years ago

MatinF commented 4 years ago

Outline We're using S3FS to connect to S3 servers and e.g. download data. In some cases the S3 server would be e.g. a MinIO S3 and in other cases an AWS S3 server. To facilitate this distinction, we typically explicitly parse the endpoint URL as below:

import s3fs

def main_func():
    fs = s3fs.S3FileSystem(
        key="XX",
        secret="XX",
        client_kwargs={"endpoint_url": "http://s3.eu-central-1.amazonaws.com",},
    )

    test = fs.open("tf5test/4823E77F/00000048/00000001-5FB65005.MF4")
    print(len(test.read()))

if __name__ == "__main__":
    main_func()
    pass

Expected behavior When running the above, we would expect to see consistently the length of the object we're trying to open.

Actual behavior When we use the above, we do indeed get the expected result on most systems - but on some systems we experience an issue where the GET request fails. Specifically, below are partial debug outputs from a working vs. non-working system from using the above code:

Working system

DEBUG botocore.endpoint _do_get_response:140 - Sending http request: <AWSPreparedRequest stream_output=False, method=HEAD, url=http://s3.eu-central-1.amazonaws.com/tf5test/4823E77F/00000048/00000001-5FB65005.MF4, headers={'User-Agent': b'Botocore/1.17.43 Python/3.7.9 Windows/10', 'X-Amz-Date': b'20201120T103416Z', 'X-Amz-Content-SHA256': b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': b'AWS4-HMAC-SHA256 Credential=AKIATDKWKZJD2AME4UXZ/20201120/eu-central-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=feb137f57fefc8743cf73b4790a9d3baef570922442d892cc4951dc6b3f4c83d'}>
DEBUG botocore.parsers parse:234 - Response headers: HTTPHeaderDict({'x-amz-id-2': 'qWolpseo/roMz33jonHXPgPxyjtGV5g4soxxWDHUNsJvZtN4GEza1BIJHvk7DSB7vEXVXE1Q+lU=', 'x-amz-request-id': '664EA4153E00EF73', 'date': 'Fri, 20 Nov 2020 10:34:17 GMT', 'last-modified': 'Thu, 19 Nov 2020 11:00:23 GMT', 'etag': '"0b7b628c7d29660d59a4aea328639e6c"', 'x-amz-meta-hw': '00.00', 'x-amz-meta-put-index': '2', 'x-amz-meta-fw': '01.03.01', 'x-amz-meta-ssid': 'WLAN-geschaeftlich', 'x-amz-meta-timestamp': '20201118T051030', 'accept-ranges': 'bytes', 'content-type': 'application/octet-stream', 'content-length': '40233854', 'server': 'AmazonS3'})

Non-working system

DEBUG botocore.endpoint _do_get_response:187 - Sending http request: <AWSPreparedRequest stream_output=True, method=GET, url=http://s3.eu-central-1.amazonaws.com/tf5test/4823E77F/00000048/00000001-5FB65005.MF4, headers={'Range': b'bytes=0-40233853', 'User-Agent': b'Botocore/1.17.43 Python/3.7.9 Windows/10', 'X-Amz-Date': b'20201119T133342Z', 'X-Amz-Content-SHA256': b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': b'AWS4-HMAC-SHA256 Credential=AKIATDKWKZJD2AME4UXZ/20201119/eu-central-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=b73e8ebb93927169b13565a8cbacfde483d6b8ab73903b32bd16826f05135547'}>
DEBUG botocore.parsers parse:234 - Response headers: {'x-amz-id-2': 'Q8qx6bLnj4+Z+D770bfAG7C1c+lNfWxkUvGj0FLKTFzNqIHNLeSqf1LUVXvw1nd0899/qQBC4dw=', 'x-amz-request-id': 'E5D8281755AA613E', 'Date': 'Thu, 19 Nov 2020 13:33:45 GMT', 'Last-Modified': 'Thu, 19 Nov 2020 11:00:23 GMT', 'ETag': '"0b7b628c7d29660d59a4aea328639e6c"', 'x-amz-meta-hw': '00.00', 'x-amz-meta-put-index': '2', 'x-amz-meta-fw': '01.03.01', 'x-amz-meta-ssid': 'WLAN-geschaeftlich', 'x-amz-meta-timestamp': '20201118T051030', 'Accept-Ranges': 'none', 'Content-Type': 'application/octet-stream', 'Content-Length': '0', 'Server': 'AmazonS3', 'Connection': 'keep-alive'}

As evident, the non-working system produces a different debug output and as part of it a 'Content-Length: '0', resulting in an error so that the file cannot be downloaded. Both systems are Windows 10, Python 3.7.9 and with the below pip freeze running in virtual environments.

If we remove the explicit parsing of the S3 endpoint, however, it seems to work - in the sense that both systems are able to correctly download the file. However, we struggle to understand the logic of this and we're hence concerned if the error may reoccur. Any help is appreciated!


pip freeze

aiobotocore==1.1.2
aiohttp==3.7.3
aioitertools==0.7.1
asammdf==5.23.1
async-timeout==3.0.1
attrs==19.3.0
bitstruct==8.11.0
botocore==1.17.44
can-decoder==0.0.2
canedge-browser==0.0.4
canmatrix==0.9.2
cchardet==2.1.5
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
docutils==0.15.2
fsspec==0.8.4
future==0.18.2
idna==2.10
influxdb-client==1.10.0
jmespath==0.10.0
lxml==4.6.1
lz4==3.1.1
mdf-iter==0.0.2
multidict==5.0.2
natsort==7.0.1
numexpr==2.7.1
numpy==1.19.1
pandas==1.1.0
pathlib2==2.3.5
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
Rx==3.1.1
s3fs==0.5.1
six==1.15.0
typing-extensions==3.7.4.3
urllib3==1.25.10
wrapt==1.12.1
xlrd==1.2.0
XlsxWriter==1.3.7
xlwt==1.3.0
yarl==1.6.3
martindurant commented 4 years ago

s3fs certain is known to work for some non-aws s3 implementations. Can you also turn on s3fs logging, so we know which call this is happening for? I see that the first block is a HEAD call and the second a GET call with bytes-range. It may be a botocore error, or perhaps some setting that is needed (perhaps following a redirect?)

martindurant commented 3 years ago

@MatinF , any further details here?

MatinF commented 3 years ago

Hi Martin, we did not find further details/insight on this I'm afraid. We ended up with a solution that effectively removes the endpoint details when AWS is used.