Closed timuralp closed 7 years ago
Looks like this issue arises when v2 signatures are used, but v4 signatures allow PUT requests to use chunked uploads.
I wanted to expand on this comment. Since v2 signer does not support chunked PUT requests, boto should never set that header if v2 signatures are used. That means at the time of preparing the request body, there has to be some consideration of which signer is being used and possibly raise an exception if the signer is v2, but the Content-Length header is not set.
Any comment on this? Do you folks believe it's not a bug or that it's not worth fixing because it affects v2 signer?
I was able to reproduce with the following:
import botocore.session
from botocore.config import Config
import io
class NonSeekableStream(object):
def __init__(self, contents):
self._body = io.BytesIO(contents)
def __iter__(self):
return self._body.__iter__()
def read(self, limit=-1):
return self._body.read(limit)
sess = botocore.session.Session()
sess.set_debug_logger()
config = Config(s3={'signature_version': 's3'})
client = sess.create_client('s3', config=config)
body = NonSeekableStream(b'foo bar baz')
client.put_object(
Bucket='mybucket',
Key='foo',
Body=body,
ContentLength=11,
ContentMD5='ab07acbb1e496801937adfa772424bf7',
)
This failing hinges on the object being non-seekable, which we don't support (docs). Even if we did remove transfer encoding, other layers would fail (such as calculating the body checksum) due to the lack of seek & tell.
As an alternative, you can use upload_fileobj
which does support non-seekable files. It does this by buffering data up until you hit your configured multipart threshold, at which point it will upload in multiple parts whose size you can configure as well.
I'm surprised to see a comment on this issue!
Correct -- I did not have to include this patch once I wrapped the incoming stream in an object that has a seek()
method. I worked around the md5 checksum requirement by disabling it (in my use case, the md5 hash is already pre-computed for the upload and computing it twice does not make sense). After disabling the md5 check, the only time seek
is called is on retry with seek(0)
, which I can do by reopening the incoming stream (or boto could not retry if seek()
is not implemented). Having explored this, I don't think there are that many layers that fail, but I understand the reluctance to fix this issue.
The upload_fileobj
suggestion does not seem like something I'd want to use, however. The stream is non-seekable because it's a stream from another storage system. Using multipart uploads for it is problematic, because the ETag computation becomes different from the md5 hash already provided. Hence, for smaller objects I would want to disable the MPU behavior (by setting it to the maximum object size), but buffering the entire object in memory would not work very well.
@timuralp Since you have the MD5 before-hand you could send it up as metadata on the object. That would give you access to it later on while allowing you to take advantage of multipart uploading.
I did consider that, however, object metadata does not show up in bucket listings, which means to get the object ETag, a HEAD is required.
So to close the loop here:
I don't think that we can safely support non-seekable streams outside of the transfer manager. There are three points of failure:
2 and 3 can often be worked around by having users provide pre-computed values, but 1 is not so simple. We just don't have enough knowledge about the data to know how to reset the data (if it even can be reset) without having buffered it all in the first place. If we just let it through anyway and accept that retries will fail then we would potentially be destroying data that can't be generated again. That isn't a risk I'm comfortable accepting.
Is there a way to exercise this issue outside of using non-seekable streams? If not, then I would rather have it fail during development so customers don't rely on it working in production.
If we just let it through anyway and accept that retries will fail then we would potentially be destroying data that can't be generated again.
I disagree that this is a customer-facing issue. If a library consumer provides an un-seekable stream, surely they don't expect boto to buffer the content in memory (or write it out on any kind of durable storage). I don't think this is a "risk" as far as the library users are concerned.
It does seem like no one else is trying to use boto to stream data from another system into S3, otherwise I'd imagine this would come up from other users. I'm going to close this issue, as I can workaround with a File-like wrapper, however, not supporting streaming an object (without using multipart uploads) is surprising.
I'm encountering an issue where uploading a file and setting the ContentLength header in the
put_object()
call still result in the Transfer-Encoding being set to chunked. This, in turn, results in the failure to upload a file, with the following error:I believe the issue is in the
botocore/vendored/requests/models.py
file, specifically:My invocation is as follows:
and I expect it to set the Content-Length and not use the chunked encoding. I'll submit a patch to consider the contents of the headers when determining which encoding to use (which also unblocks my usage of boto3).