exponential-decay / demystify

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.
http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export
zlib License
23 stars 5 forks source link

Report on how many dotfiles/resource forks #86

Open kieranjol opened 2 years ago

kieranjol commented 2 years ago

So I've a question that relates a bit to #83 , I often see several thousand AppleDouble Resource Forks in reports - the donor doesn't expect to donate these to us, they're dotfiles so they're hidden on MacOS, so outside of a disk image or zip, they won't be selected for preservation. I'd love for demystify to be able to say:

More thought might be needed in terms of how this might be most useful, and what kinds of other files might need to be lumped into the aggregate value. I know that sometimes resourceforks and dotfiles will be selected for preservation in some contexts, but it would be good to know how many are there, and how many files are not resource forks or hidden files. Curious to know what you think anyhow - just tried demystify-lite for the first time and I'm so impressed!

ross-spencer commented 2 years ago

I can conceive of adding this in a section on it's own, it's a good idea. Will also have a think. If you have any good Siegfried/DROID exports with some of this information in, it will be a great help, and we'll anonymize that and put it into the unit tests.

Also, in the meantime, perhaps have a look at how the denylist impacts the output for you and see if it begins to get close to what you're thinking.

Regarding implementation, then there's some work to do to make each section more customizable, something like the strategy pattern to enable different sections to be configured for use, and output, as desired by the caller. That's a little bit further down the line once the Py3 release is stabilized and packaging works. Until then, adding another section might feel quite onerous for anyone taking it on - but I'm also more than happy to take a look with the right data to help support it.