digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
284 stars 75 forks source link

Spike - random access S3 #881

Open paulwellsagilej opened 1 year ago

paulwellsagilej commented 1 year ago

Make a version of DROID which identifies objects from S3

OliverHannan commented 1 year ago

Thanks for the quick demo @paulwellsagilej, enjoyed the thinking.

Look forward to seeing a quick write up (don't let perfect get in the way of done though!)

Would be good to run though again, recorded, so we can share with colleagues at TNA in their own time.

paulwellsagilej commented 1 year ago

A new branch of DROID has been created called spike-s3. The code in this branch is not intended to be merged.

paulwellsagilej commented 1 year ago

S3RandomAccess.pdf

steve-daly commented 3 weeks ago

Some work was completed on this functionality, but the resultant code doesn't work for some reason. It would be good to continue the exploration, making use of this previous branch as appropriate.

As per the previous work, we would want to use the DROID CLI to request characterisation of a file (as happens currently) but with the file path pointing to an S3 URI. This could assume that the user calling DROID is able to access that S3 URI (e.g. using AWS credentials file). For proof-of-concept reasons, just CLI will be fine, but eventually we'd want to the Java API to support this.

DROID should only read the bytes from S3 that are needed to characterise the file, and ideally the same byte wouldn't be downloaded from S3 more than once (within reason). We would therefore need a local cache, but this could be limited to avoid accidentally caching entire huge files where every byte needs to be read due to variable signatures.

We're interested to see how performant this would be, and how it could be optimised, for example reading blocks of bytes at a time from S3 rather than individual bytes (to cut down on API requests) by pre-caching the start of the file, for example, by pre-calculating the range that will inevitably be read. We can come to this later once a basic version is functionally working. It might be nice to count the number of S3 API requests made in each run by DROID, and report this.