epam / cloud-pipeline

Cloud agnostic genomics analysis, scientific computation and storage platform
https://cloud-pipeline.com
Apache License 2.0
144 stars 58 forks source link

pipe CLI: enable checksum calculation for S3 #3512

Open sidoruka opened 2 months ago

sidoruka commented 2 months ago

pipe CLI shall be capable of:

Details: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

ekazachkova commented 2 months ago

According to the AWS doc: "A pre-calculated checksum value provided with the request disables automatic computation by the SDK and uses the provided value instead."

Let's look at this approach using as example CRC32 algorithm

Prerequisite

Pre-calculated checksum

Python3

import base64
import zlib

filepath = '<filepath>'
with open(filepath, 'rb') as f:
    crc_raw = zlib.crc32(f.read()) 

crc_bytes = crc_raw.to_bytes(4, 'big')
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')

Python2

import base64
import zlib
import struct

filepath = '<filepath>'
with open(filepath, 'rb') as stream:
    crc_raw = zlib.crc32(stream.read())

crc_bytes = struct.pack('>i', crc_raw)
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')

Upload file using AWS CLI

aws s3api put-object --bucket <bucket> --key <key> --checksum-crc32 "<crc_base64>" --body "<bucket>" 
Response:
{
    ...,
    "ChecksumCRC32": "<crc_base64>",
    ....
}

For example, to support it in pipe cli we need to add header: 'x-amz-checksum-crc32': '<crc_base64>' to put_object request

Check file checksum

aws s3api get-object-attributes --bucket <bucket> --key <key> --object-attributes Checksum 
Response:
{
    ...,
    "Checksum": {
        "ChecksumCRC32": "<crc_base64>"
    }
}

or using head object request

aws s3api get-object --bucket  <bucket> --key <key> --checksum-mode=ENABLED

Multipart upload:

Implementation steps for old boto3

Upload:

Download: