epam / cloud-pipeline

Cloud agnostic genomics analysis, scientific computation and storage platform
Apache License 2.0
144 stars 58 forks source link

pipe CLI: enable checksum calculation for S3 #3512

Open sidoruka opened 2 months ago

sidoruka commented 2 months ago

pipe CLI shall be capable of:

Details: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

ekazachkova commented 2 months ago

According to the AWS doc: "A pre-calculated checksum value provided with the request disables automatic computation by the SDK and uses the provided value instead."

Let's look at this approach using as example CRC32 algorithm


Pre-calculated checksum


import base64
import zlib

filepath = '<filepath>'
with open(filepath, 'rb') as f:
    crc_raw = zlib.crc32(f.read()) 

crc_bytes = crc_raw.to_bytes(4, 'big')
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')


import base64
import zlib
import struct

filepath = '<filepath>'
with open(filepath, 'rb') as stream:
    crc_raw = zlib.crc32(stream.read())

crc_bytes = struct.pack('>i', crc_raw)
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')

Upload file using AWS CLI

aws s3api put-object --bucket <bucket> --key <key> --checksum-crc32 "<crc_base64>" --body "<bucket>" 
    "ChecksumCRC32": "<crc_base64>",

For example, to support it in pipe cli we need to add header: 'x-amz-checksum-crc32': '<crc_base64>' to put_object request

Check file checksum

aws s3api get-object-attributes --bucket <bucket> --key <key> --object-attributes Checksum 
    "Checksum": {
        "ChecksumCRC32": "<crc_base64>"

or using head object request

aws s3api get-object --bucket  <bucket> --key <key> --checksum-mode=ENABLED

Multipart upload:

Implementation steps for old boto3

