CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

Audit - optimize using segment io digests for fixity #642

Closed dloy closed 3 years ago

dloy commented 3 years ago

Problem

Current digest generation for performing fixity requires downloading the entire cloud file to local storage and then creating a digest on that file as a single IO stream. This was initially used to avoid the problem of partial/failed downloads on large content.

The downloading of large files locally has two problems:

Solution

This solution was suggested by the following article: Streaming large objects from S3 with ranged GET requests

Digest creation is handled by iteratively modifying the digest using blocks of data. The typical approach:

Segment approach:

With the use of partial S3 reads off cloud storage, the local file can be eliminated. The trade-off eliminates temp space but does require additional memory.

Tests

Current testing indicates:

Timing on 4.66G: AWS - 143.7 sec Wasabi - 135 sec SDSC - 155.2

Install CloudChecksum into Audit

dloy commented 3 years ago

The original technique ran extremely fast on a large EC2 box. When running on a small, used by audit, it ran very close to hosing the system and throughput was roughly 10% of what I was seeing on my large server (sandbox2).

The issue was memory.

Two changes allowed it to run at close to the speed on the larger box:

Successfully running on: