ACED-IDP / gen3_util

Collection of command line tools to interact with a Gen3 instance
MIT License
3 stars 1 forks source link

Feature Request: add support for additional file hashes (e.g. etag) #68

Closed lbeckman314 closed 5 months ago

lbeckman314 commented 8 months ago

Background

Multiple hashes are allowed for the importing of files into the indexd service, including etags:

ACCEPTABLE_HASHES = {
    "md5": re.compile(r"^[0-9a-f]{32}$").match,
    "sha1": re.compile(r"^[0-9a-f]{40}$").match,
    "sha256": re.compile(r"^[0-9a-f]{64}$").match,
    "sha512": re.compile(r"^[0-9a-f]{128}$").match,
    "crc": re.compile(r"^[0-9a-f]{8}$").match,
    "etag": re.compile(r"^[0-9a-f]{32}(-\d+)?$").match,
}

Current Behavior

Currently the g3t command requires the md5 hash of the file to be provided in order to be uploaded to the indexd service. In the case where this hash is not available (i.e. importing files from an existing S3 endpoint) it can take a rather long amount of time to both download the file and calculate it's md5 hash.

New Behavior

Adding support for additional hashes like etag would allow for greater efficiency when uploading files where the md5 hash is not immediately available or not yet calculated.

For remote files already registered in an S3 bucket the etag hash can be fetched with the MinIO client as follows:

➜ mc stat -r example-s3/example-bucket --json
{
 "status": "success",
 "name": "example-bucket/example-file",
 "lastModified": "2024-01-01T00:59:20-08:00",
 "size": 123,
 "etag": "4pophfvzd8eo8pir7i2sgzn4nifz88jho-1234",   <--- example etag hash
 "type": "file",
 "metadata": {
  "Content-Type": "application/gzip"
 }
}

Steps for Implementing

Environment

bwalsh commented 5 months ago

addressed here https://github.com/ACED-IDP/gen3_util/blob/730584972ccd847b2dbdda89810a5abf36eb27ec/gen3_tracker/git/__init__.py#L68-L73