digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
275 stars 75 forks source link

Ability to characterise files in S3 storage #860

Open steve-daly opened 1 year ago

steve-daly commented 1 year ago

A suggested new feature for consideration.

Currently if wanting to characterise files in cloud storage, the files must be brought locally before being analysed. For large (e.g. video) files this is a considerable overhead if only the small header of the file would be checked.

I'm proposing that DROID would be able to access files natively in cloud (S3, or S3-compatible) storage, both from the GUI, Command Line and API.

Suitable credentials (and maybe region?) could be supplied to DROID and then when navigating to folders or files in the GUI, an initial S3 path could be given in the form s3://bucket/ or s3://bucket/prefix1/prefix2/ etc from where the file browser could begin navigation.

DROID would then access the required bytes of the files using S3 APIs rather than traditional file access methods.

In reports, the paths to these files could look like s3://bucket/prefix1/prefix2/file.ext etc rather than file:/C:/folder1/folder2/file.extetc.

This would provide large efficiencies when using DROID to characterise/re-characterise files in Cloud storage, and enhance cloud-based workflows (e.g. running DROID from a Lambda using the Java API).

Just enabling this capability for the API and/or Command Line could be an initial step before integrating S3 into the GUI file browser.

paulwellsagilej commented 1 year ago

It's certainly a good idea. My initial thoughts: DROID would need to report if S3 object permissions blocked access (the user may not realise access is blocked and conclude that DROID is not working). DROID would also need to be aware that the S3 object may be versioned, in which case re-running a profile in the future may yield different results. It could work well if a stream is opened and an early match is found in the header.

paulwellsagilej commented 1 year ago

Another thought: the DROID input file filtering could be made aware that it is working with S3 - then instead of requesting a wildcard bucket object listing and applying the filtering to the results of the listing, the S3 listing request could be narrowed to match the profile as it is created and that would run faster because AWS would perform the input filtering for us.

steve-daly commented 1 year ago

Yes, we would have to make sure any error reporting is meaningful as there are a range of things that could prevent access to certain files. We could see what the current behaviour is for shared drives with similar issues, e.g. some files within a path can't be accessed due to permissions. If we are happy with that error reporting, then we could aim to match it for S3 access, rather than just passing verbose AWS messages through.

I did think about versioning, but we have the same issue with local files too - that they can be changed/updated between runs of DROID. We are including Last Modified Date in the reports so could get that for S3 files too and hopefully the situation would be equivalent. But, it might be useful to consider providing the S3 version ID as an optional field for the exports, but I wanted to keep things as close to the current (output) behaviour as possible.