DataBiosphere / terra-notebook-utils

Utilities for the Terra notebook environment.
MIT License
7 stars 6 forks source link

Copying large files on GCP gives an error due to missing md5 hash #412

Open bgorissen opened 5 months ago

bgorissen commented 5 months ago

When copying data from drs to gs with drs.copy(), it copies files larger than 100 MB in chunks via _copy_multipart_passthrough. That function performs an md5 check:

if dst_blob.md5 != src_blob.md5:

For GCP, the md5 property is retrieved via:

    @property
    def md5(self) -> str:
        gs_md5 = self._get_native_blob().md5_hash
        return base64.b64decode(gs_md5).hex()

However, composite objects do not have the md5_hash property:

Composite objects do not have an MD5 hash metadata field.

Passing None to base64.b64decode results in an error. GCP still provides a crc32c hash.