boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.47k stars 1.07k forks source link

Setting Content-Length still results in Transfer-Encoding set to "chunked" #911

Closed timuralp closed 7 years ago

timuralp commented 8 years ago

I'm encountering an issue where uploading a file and setting the ContentLength header in the put_object() call still result in the Transfer-Encoding being set to chunked. This, in turn, results in the failure to upload a file, with the following error:

2016-05-05 23:37:25,374 botocore.vendored.requests.packages.urllib3.connectionpool [DEBUG] "PUT /AUTH_test/test/bash_logout HTTP/1.1" 501 None
DEBUG:botocore.vendored.requests.packages.urllib3.connectionpool:"PUT /AUTH_test/test/bash_logout HTTP/1.1" 501 None
2016-05-05 23:37:25,388 botocore.parsers [DEBUG] Response headers: {'x-amz-id-2': 'ZTpbRioCBOJR3RwdOUDSHgwMBABo8O8oFSpmvyG7F3turQ6Q9MPRg5JMjxI49Rk57MXudMm6lSQ=', 'server': 'AmazonS3', 'transfer-encoding': 'chunked', 'connection': 'close', 'x-amz-request-id': '008D7F33A19459E1', 'date': 'Thu, 05 May 2016 23:37:17 GMT', 'content-type': 'application/xml'}
DEBUG:botocore.parsers:Response headers: {'x-amz-id-2': 'ZTpbRioCBOJR3RwdOUDSHgwMBABo8O8oFSpmvyG7F3turQ6Q9MPRg5JMjxI49Rk57MXudMm6lSQ=', 'server': 'AmazonS3', 'transfer-encoding': 'chunked', 'connection': 'close', 'x-amz-request-id': '008D7F33A19459E1', 'date': 'Thu, 05 May 2016 23:37:17 GMT', 'content-type': 'application/xml'}
2016-05-05 23:37:25,397 botocore.parsers [DEBUG] Response body:
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NotImplemented</Code><Message>A header you provided implies functionality that is not implemented</Message><Header>Transfer-Encoding</Header><RequestId>008D7F33A19459E1</RequestId><HostId>ZTpbRioCBOJR3RwdOUDSHgwMBABo8O8oFSpmvyG7F3turQ6Q9MPRg5JMjxI49Rk57MXudMm6lSQ=</HostId></Error>

I believe the issue is in the botocore/vendored/requests/models.py file, specifically:

440         if is_stream:
441             body = data
442 
443             if files:
444                 raise NotImplementedError('Streamed bodies and files are mutually exclusive.')
445 
446             if length is not None:
447                 self.headers['Content-Length'] = builtin_str(length)
448             else:
449                 self.headers['Transfer-Encoding'] = 'chunked'

My invocation is as follows:

s3_client.put_object(Bucket=self.aws_bucket, Key=key,
                     Body=input_stream,
                     ContentLength=length)

and I expect it to set the Content-Length and not use the chunked encoding. I'll submit a patch to consider the contents of the headers when determining which encoding to use (which also unblocks my usage of boto3).

timuralp commented 8 years ago

Looks like this issue arises when v2 signatures are used, but v4 signatures allow PUT requests to use chunked uploads.

I wanted to expand on this comment. Since v2 signer does not support chunked PUT requests, boto should never set that header if v2 signatures are used. That means at the time of preparing the request body, there has to be some consideration of which signer is being used and possibly raise an exception if the signer is v2, but the Content-Length header is not set.

timuralp commented 8 years ago

Any comment on this? Do you folks believe it's not a bug or that it's not worth fixing because it affects v2 signer?

JordonPhillips commented 7 years ago

I was able to reproduce with the following:

import botocore.session
from botocore.config import Config
import io

class NonSeekableStream(object):
    def __init__(self, contents):
        self._body = io.BytesIO(contents)

    def __iter__(self):
        return self._body.__iter__()

    def read(self, limit=-1):
        return self._body.read(limit)

sess = botocore.session.Session()
sess.set_debug_logger()
config = Config(s3={'signature_version': 's3'})
client = sess.create_client('s3', config=config)
body = NonSeekableStream(b'foo bar baz')

client.put_object(
    Bucket='mybucket',
    Key='foo',
    Body=body,
    ContentLength=11,
    ContentMD5='ab07acbb1e496801937adfa772424bf7',
)

This failing hinges on the object being non-seekable, which we don't support (docs). Even if we did remove transfer encoding, other layers would fail (such as calculating the body checksum) due to the lack of seek & tell.

As an alternative, you can use upload_fileobj which does support non-seekable files. It does this by buffering data up until you hit your configured multipart threshold, at which point it will upload in multiple parts whose size you can configure as well.

timuralp commented 7 years ago

I'm surprised to see a comment on this issue!

Correct -- I did not have to include this patch once I wrapped the incoming stream in an object that has a seek() method. I worked around the md5 checksum requirement by disabling it (in my use case, the md5 hash is already pre-computed for the upload and computing it twice does not make sense). After disabling the md5 check, the only time seek is called is on retry with seek(0), which I can do by reopening the incoming stream (or boto could not retry if seek() is not implemented). Having explored this, I don't think there are that many layers that fail, but I understand the reluctance to fix this issue.

The upload_fileobj suggestion does not seem like something I'd want to use, however. The stream is non-seekable because it's a stream from another storage system. Using multipart uploads for it is problematic, because the ETag computation becomes different from the md5 hash already provided. Hence, for smaller objects I would want to disable the MPU behavior (by setting it to the maximum object size), but buffering the entire object in memory would not work very well.

JordonPhillips commented 7 years ago

@timuralp Since you have the MD5 before-hand you could send it up as metadata on the object. That would give you access to it later on while allowing you to take advantage of multipart uploading.

timuralp commented 7 years ago

I did consider that, however, object metadata does not show up in bucket listings, which means to get the object ETag, a HEAD is required.

JordonPhillips commented 7 years ago

So to close the loop here:

I don't think that we can safely support non-seekable streams outside of the transfer manager. There are three points of failure:

  1. Rewinding the stream so we can retry.
  2. Calculating content-length, which is often required by service teams.
  3. Any customization that requires reading the object, such as generating the content-md5.

2 and 3 can often be worked around by having users provide pre-computed values, but 1 is not so simple. We just don't have enough knowledge about the data to know how to reset the data (if it even can be reset) without having buffered it all in the first place. If we just let it through anyway and accept that retries will fail then we would potentially be destroying data that can't be generated again. That isn't a risk I'm comfortable accepting.

Is there a way to exercise this issue outside of using non-seekable streams? If not, then I would rather have it fail during development so customers don't rely on it working in production.

timuralp commented 7 years ago

If we just let it through anyway and accept that retries will fail then we would potentially be destroying data that can't be generated again.

I disagree that this is a customer-facing issue. If a library consumer provides an un-seekable stream, surely they don't expect boto to buffer the content in memory (or write it out on any kind of durable storage). I don't think this is a "risk" as far as the library users are concerned.

It does seem like no one else is trying to use boto to stream data from another system into S3, otherwise I'd imagine this would come up from other users. I'm going to close this issue, as I can workaround with a File-like wrapper, however, not supporting streaming an object (without using multipart uploads) is surprising.

gaul commented 7 years ago

For what it is worth, Apache jclouds allows streaming InputStream payloads and callers must provide Content-Length. jclouds cannot retry failed streaming requests and propagates the error to the caller for the application to handle. S3Proxy relies on this functionality.