boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.46k stars 1.07k forks source link

Using Generated Presigned URLs with CRC32C checksums results in 400 from S3 #3216

Open richardnpaul opened 3 weeks ago

richardnpaul commented 3 weeks ago

Describe the bug

When trying to upload a large object to S3 using the multipart upload process with presigned urls with crc32c checksums the response from S3 is a 400 error with an error message.

Expected Behavior

I would expect that the provided checksum headers would be expected and so the type would be the checksum type not a type of null which would then mean that the upload to S3 would succeed.

Current Behavior

The following type of error message is returned instead of success:

Failed to upload part, status: 400, response: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidRequest</Code><Message>Checksum Type mismatch occurred, expected checksum Type: null, actual checksum Type: crc32c</Message><RequestId>SOMEREQID</RequestId><HostId>SOME/HOSTID</HostId></Error>

Reproduction Steps

Change all the AWS credentials for valid values for your testing and provide a file on the testfile assignment line (I was using a path in ~/Downloads/)

#!/usr/bin/env python3
import base64
import pathlib
from zlib import crc32

import boto3
import requests

# AWS credentials
access_key_id = 'access_key_here'
secret_access_key = 'secret_key_here'
aws_session_token = 'session_token_here'
region = 'region_here'
bucket_name = 'bucket_name_here'
object_key = 'prefix_here/object_key_here'

# Create a session using your AWS credentials
session = boto3.Session(
    aws_access_key_id=access_key_id,
    aws_secret_access_key=secret_access_key,
    aws_session_token=aws_session_token,
)

# Create an S3 client with the specified region
s3_client = session.client('s3', region_name=region)

# Initialize a multipart upload
response = s3_client.create_multipart_upload(
    Bucket=bucket_name,
    Key=object_key
)
upload_id = response['UploadId']

part_number = 1
chunk_size = 10 * 1024 * 1024  # 10 MB

testfile = pathlib.Path('file 10MB or greater in size here').expanduser()

with open(testfile, 'rb') as f:
    content = f.read(chunk_size)

# Calculate ChecksumCRC32C   (I'm not 100% certain about this as we use the crc32c package normally)
checksum_crc32c = base64.b64encode(crc32(content).to_bytes(4, byteorder='big')).decode('utf-8')

# Generate the presigned URL
presigned_url = s3_client.generate_presigned_url(
    'upload_part',
    Params={
        'Bucket': bucket_name,
        'Key': object_key,
        'PartNumber': part_number,
        'UploadId': upload_id,
        'ChecksumCRC32C': checksum_crc32c,
        'ChecksumAlgorithm': 'CRC32C',  # Added this after posting after feedback from Tim
    },
    ExpiresIn=3600
)

headers = {
    'Content-Length': str(len(content)),
    'x-amz-checksum-crc32c': checksum_crc32c,
    'Content-Type': 'application/octet-stream',
}

response = requests.put(presigned_url, data=content, headers=headers)

if response.status_code == 200:
    print("Part uploaded successfully!")
else:
    print(f"Failed to upload part, status: {response.status_code}, response: {response.text}")

Possible Solution

I feel like the checksum header is not being passed to be included in the signing process but to be honest I got a bit lost in the library's code and couldn't make head nor tail of it in the end.

Additional Information/Context

Docs page for generating the urls: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/generate_presigned_url.html Docs page with acceptable params to be passed to generate_presigned_url when using upload_part as the ClientMethod: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/upload_part.html

SDK version used

1.34.138

Environment details (OS name and version, etc.)

Ubuntu 22.04.4, Python 3.10.12

tim-finnigan commented 2 weeks ago

Thanks for reaching out. In your upload_part request have you tried setting the ChecksumAlgorithm to CRC32C and specifying a string for ChecksumCRC32C? You could also try another approach like using put_object, although it was noted that installing the CRT was required. Otherwise if you want to share your debug logs (with any sensitive info redacted) by adding boto3.set_stream_logger('') to your script then we could investigate this further.

richardnpaul commented 2 weeks ago

Hi @tim-finnigan, thanks for getting back in touch so quickly.

We did try the ChecksumAlgorithm set to CRC32C approach, which required then setting the x-amz-sdk-checksum-crc32c header I believe, but we were getting an error with using this method too (I'll need to check the docs again, but we were reading these where we were following the points for the REST API rather than the SDK, and I'll need to check with the person that was testing this with me tomorrow.) The code for this is abstracted out behind a set of APIs and a calling CLI (not Python based) installed by our end users.

Thus our workflow is this, CLI calls Initiate endpoint to initiate an upload. On success the CLI can then call a generate pre-signed URLs endpoint which should take the parts and the checksums and return the part numbers with the pre-signed URLs for those parts (and this is the call which is using generate_presigned_url with the upload_part client method.) At this point the CLI uses the pre-signed URLs to PUT the file parts directly to S3 with the CRC32C checksum in the header and once that's complete it can call a complete endpoint submitting the parts, ETags and CRC32C checksums.

So with the description above out of the way, put_object is not suitable for our workflow because the end users are using the CLI package; which is also the why of the need to use pre-signed URLs.. Sorry for the confusion that might have led you to suggest this as the above code was just a minimal amount of boiler plate code to duplicate the issue that we were seeing.

I will note that we do already have awscrt as part of our dependency chain.

We have run this through successfully by removing the need for the checksums and it all works, so worst case we could fall back to the historic way of doing this using ContentMD5 but we were hoping to use the same approach that we're using for smaller unitary uploads which uses presigned_POST which we do have working with the CRC32C checksums; I'm well aware that we seem to be on the outer fringes of what we're trying to achieve here with boto so all help is greatly appreciated.

richardnpaul commented 2 weeks ago

I've done some testing today and here's a table of what I get back from the put to s3. So I've tested every combination of the ChecksumAlgorithm and ChecksumCRC32C on the upload_part side and x-amz-checksum-crc32c and x-amz-sdk-checksum-algorithm on the PUT headers side of things (we didn't get any different results with passing content-type and/or content-length as well as these):

▼headers/params► Nothing ChecksumCRC32C Only ChecksumAlgorithm Only Both
Nothing 200 403: SignatureDoesNotMatch*1 403: SignatureDoesNotMatch*1 403: SignatureDoesNotMatch*1
x-amz-checksum-crc32c 403: AccessDenied*2 400: InvalidRequest*3 403: AccessDenied*2 403: SignatureDoesNotMatch*1
x-amz-checksum-algorithm 403: AccessDenied*2 403: AccessDenied*2 400: InvalidRequest*4 403: SignatureDoesNotMatch*1
Both 403: AccessDenied*2 403: AccessDenied*2 403: AccessDenied*2 400: InvalidRequest*3

*1: The request signature we calculated does not match the signature you provided. Check your key and signing method. *2: There were headers present in the request which were not signed *3: Checksum Type mismatch occurred, expected checksum Type: null, actual checksum Type: crc32c *4: x-amz-sdk-checksum-algorithm specified, but no corresponding x-amz-checksum-* or x-amz-trailer headers were found.

tim-finnigan commented 2 weeks ago

Hi @richardnpaul, thanks for following up here. Going back to your original snippet, you are using CRC32 and not CRC32C (from zlib import crc32). It looks like there are not plans to support CRC32C in zlib: https://github.com/madler/zlib/issues/981. Have you tried any alternatives that support CRC32C?

For using CRC32 I tested this and it works for me: ```python3 import boto3 import requests from zlib import crc32 import base64 import pathlib bucket_name = 'test-bucket' object_key = 'test' s3_client = boto3.client('s3') response = s3_client.create_multipart_upload( Bucket=bucket_name, Key=object_key ) upload_id = response['UploadId'] part_number = 1 chunk_size = 10 * 1024 * 1024 # 10 MB testfile = pathlib.Path('./11-mb-file.txt').expanduser() parts = [] with open(testfile, 'rb') as f: while True: content = f.read(chunk_size) if not content: break checksum_crc32 = base64.b64encode(crc32(content).to_bytes(4, byteorder='big')).decode('utf-8') presigned_url = s3_client.generate_presigned_url( 'upload_part', Params={ 'Bucket': bucket_name, 'Key': object_key, 'PartNumber': part_number, 'UploadId': upload_id, 'ChecksumCRC32': checksum_crc32, 'ChecksumAlgorithm': 'CRC32', }, ExpiresIn=3600 ) response = requests.put(presigned_url, data=content) if response.status_code == 200: print(f"Part {part_number} uploaded successfully!") parts.append({ 'PartNumber': part_number, 'ETag': response.headers['ETag'] }) else: print(f"Failed to upload part {part_number}, status: {response.status_code}, response: {response.text}") break part_number += 1 if len(parts) == part_number - 1: s3_client.complete_multipart_upload( Bucket=bucket_name, Key=object_key, UploadId=upload_id, MultipartUpload={ 'Parts': parts } ) print("Multipart upload completed successfully!") else: s3_client.abort_multipart_upload( Bucket=bucket_name, Key=object_key, UploadId=upload_id ) print("Multipart upload failed and has been aborted.") ```
richardnpaul commented 2 weeks ago

Hi Tim,

Okay, so yes, as noted in my initial notes yes, we use the crc32c package, but we're just trying to test that the checksums work so it doesn't matter which one we use apart from it should be valid.

I've taken your code and made a couple of changes, I've added aws_access_key_id etc. to the s3_client instantiation, I changed the bucket name, object key and the testfile variables and otherwise I didn't change anything else......and I got an error Failed to upload part 1, status: 403, response: <?xml version="1.0" encoding="UTF-8"?> which was because I got a SignatureDoesNotMatch ... The request signature we calculated does not match the signature you provided. Check your key and signing method. response.

I had the bucket deployed in eu-west-2 so I tried to create a bucket in another region, eu-west-1 to see if the issue persisted. After thinking that it did persisit, and working through some issues, I changed all the region references in my .aws/config file to eu-west-1 as they were set to eu-west-2 and we have success...but not in the region that I'm trying to use :disappointed: _(I realised shortly after that I could have just added region_name = "eu-west-1 to the s3 client so that I didn't have to change my config file :facepalm:)_

So, at this point I'm not sure if this is a botocore/boto3 issue or an AWS infrastructure issue :thinking: (...or something else)

richardnpaul commented 2 weeks ago

Just some additional information, adding explicit v4 signature_version via botocore.config results in the same error in both eu-west-1 and eu-west-2:

from botocore.config import Config

my_config = Config(signature_version = 'v4')

s3_client = boto3.client('s3', config=my_config)