hashdeep manifest/audit file handling

mcampos-quinn commented 6 years ago

Now hashdeep manifests and audit files are written to the parent directory of the dir being hashed/audited. That's silly, so come up with a better plan.

Plan A: write the files to /tmp and then store them as blobs (and as text?) in the db.

Plan B: store the files locally in a permanent storage place (really don't like this idea, but will prob start here until the db is ready)

kieranjol commented 6 years ago

It sounds like you're currently storing them as sidecars which sounds like it makes sense to me?

kieranjol commented 6 years ago

Of course you wanna store them in a DB as well, but sidecars are very handy.. Unless I'm misreading..

mcampos-quinn commented 6 years ago

Yeah! Ideally I would store them as sidecars along with the AIP on tape. This is a first pass, and since I'm using hashdeep I can't think yet of a good way to ignore the manifest during the audit process.

In the current setup/folder structure I have there's nowhere sane for the sidecar to live and not screw up the audit. So... I have to think about it a bit. In a previous iteration I was using bagit... which I am not sure I want to do. I also looked at your manifest script which has an option to store the manifest in the target directory. I may try to adopt that one?

kieranjol commented 6 years ago

I dunno if my manifest script would help.. I just ran some of your functions directly there and all seemed to work

>>> a = '/home/kieranjol/fakedlete'
>>> makeMetadata.make_hashdeep_manifest(a)
'/home/kieranjol/hashdeep_manifest_fakedlete_2018-03-30T00:35:38.txt'
>>> makeMetadata.hashdeep_audit(a, '/home/kieranjol/hashdeep_manifest_fakedlete_2018-03-30T00:35:38.txt')
0
hashdeep: Audit passed
True

Looks like the manifest is not part of the audit there, I'm curious to know what's happening on your end - the audit is failing cos it's picking up the manifest?

I was using md5deep at the start and then moved to custom hashlib based scripts just to get some more control and to limite dependencies. i like your manifest though via hashdeep, never seen them like that before. I don't like bagit either..

mcampos-quinn commented 6 years ago

Also @kieranjol I just read your blog post... So timely! Thanks for sharing that real life application of this stuff.

kieranjol commented 6 years ago

Cheers!

mcampos-quinn commented 6 years ago

Hey thanks for checking it out! The audit was failing as i first wrote it bc the manifest was getting written to the directory getting manifested... So the audit was like 'wtf is this manifest doing here??' Both the manifest and audit functions now write to os.pardir which is very flimsy. Another option is to make a bagit style super-parent directory where manifests live... I am going to have to eat a lot of Easter chocolate and ruminate until next week.

Yes my hands are kind of tied with the built-in hashdeep behavior but interestingly there's an open issue from 2012 the hashdeep source that proposes an 'ignore this file' function. I just need to learn C and fix the problem myself!

kieranjol commented 6 years ago

Ah... I'd never heard of os.pardir before, cool!

i think the bagit style way of doing things might make sense. We have a slightly bloated system like this:

oe8888/
├── 2597f2ef-a18b-43e3-a809-f9c548ee02ce
│   ├── logs
│   │   ├── 2597f2ef-a18b-43e3-a809-f9c548ee02ce_sip_log.log
│   │   ├── 650x0m2.pdf_manifest.md5
│   │   ├── metadata_manifest.md5
│   │   └── stupid_memes.mkv_manifest.md5
│   ├── metadata
│   │   ├── 650x0m2.pdf_exiftool.json
│   │   ├── 650x0m2.pdf_siegfried.json
│   │   ├── stupid_memes.mkv_mediainfo.xml
│   │   └── stupid_memes.mkv_mediatrace.xml
│   └── objects
│       ├── 650x0m2.pdf
│       └── stupid_memes.mkv
└── 2597f2ef-a18b-43e3-a809-f9c548ee02ce_manifest.md5

and the manifest looks like this:

9e575153a5f8b0f471bb4803f7f4086d  2597f2ef-a18b-43e3-a809-f9c548ee02ce/logs/2597f2ef-a18b-43e3-a809-f9c548ee02ce_sip_log.log
de0de80821c5f02374ea37e59af30830  2597f2ef-a18b-43e3-a809-f9c548ee02ce/logs/650x0m2.pdf_manifest.md5
e9fa539dff3957d9a1f3c4b069ca3857  2597f2ef-a18b-43e3-a809-f9c548ee02ce/logs/metadata_manifest.md5
38c4952269a76e0619fd1cd3b9ac794c  2597f2ef-a18b-43e3-a809-f9c548ee02ce/logs/stupid_memes.mkv_manifest.md5
e85aebdfd33143a7ab0a657b1b4aa90e  2597f2ef-a18b-43e3-a809-f9c548ee02ce/metadata/650x0m2.pdf_exiftool.json
319b4b09a956c01b6d48ffc99a42ec74  2597f2ef-a18b-43e3-a809-f9c548ee02ce/metadata/650x0m2.pdf_siegfried.json
10144a0c96c5eaebc88a6caae6f19c91  2597f2ef-a18b-43e3-a809-f9c548ee02ce/metadata/stupid_memes.mkv_mediainfo.xml
3d8248e943954fbf20b79b8ae6f87036  2597f2ef-a18b-43e3-a809-f9c548ee02ce/metadata/stupid_memes.mkv_mediatrace.xml
c3574b79421d6d85e3094906483e7aa8  2597f2ef-a18b-43e3-a809-f9c548ee02ce/objects/650x0m2.pdf
e53b549618ec14d7199bd2fbbf6e4bfa  2597f2ef-a18b-43e3-a809-f9c548ee02ce/objects/stupid_memes.mkv

There's reasons why we have that parentID on its own, with the UUID folder and manifest beneath.. Basically that OE id relates to the SPECTRUM collections management procedures. Prior to accessioning, we register objects in an an Object Entry register - if they are accessioned, they get an aaaXXXXX number and that OE number gets renamed. Kinda awkward, but the UUID is a sort of permanent ID that doesn't change from when the file gets OE'd. It's not the best solution, but it as a compromise that allowed us to make progress. it works very well with our systems anyhow. When we brought in that system, copyit.py was easily changed so that it would figure out that a manifest already existed even if you used the parent OEXXXX folder as the input.

mcampos-quinn commented 6 years ago

Neat! Lol, stupid memes. I think that type of structure is what I'll end up with. Since our CMS is just a filemaker database that has grown organically over like 20 years we have our own quirks. We are also taking in assets that are formally accessioned and those without any kind of accession ID so the UUID created on ingest levels the playing field a bit. I currently have the manifest named for the 'canonical' name for an ingested object (I think just the basename of a file or dir) for human readability but maybe the UUID makes more sense in the manifest name.

Thanks for your continued input! It's so appreciated!

kieranjol commented 6 years ago

BTW - I see that you have colons in your audit filenames, I think these might start to cause problems when writing to LTO - we definitely had issues anyhow -https://www.ibm.com/support/knowledgecenter/en/STQNYL_2.4.0/ltfs_restricted_characters.html

mcampos-quinn commented 6 years ago

Thanks for the heads up! I thought they might be an issue somewhere down the line....

On Apr 1, 2018 5:42 AM, "kieranjol" notifications@github.com wrote:

BTW - I see that you have colons in your audit filenames, I think these might start to cause problems when writing to LTO - we definitely had issues anyhow - https://www.ibm.com/support/knowledgecenter/en/STQNYL_2.4.0/ltfs_restricted_characters.html

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BAM-PFA/pymm/issues/3#issuecomment-377784269, or mute the thread https://github.com/notifications/unsubscribe-auth/APR9X7nzVTOcsWEf2N4hzcauZPHdw4cYks5tkMsqgaJpZM4TBB_3 .

mcampos-quinn commented 6 years ago

I restructured the SIP directory to include a parent folder that encloses the package and is named the same as the ingest UUID. This feels a little dirty to me but maybe it's no big deal. I will close this issue once I think that name clash through a bit more.

The hashdeep manifest and audit files can now live in the parent directory as sidecars to the package. I also have the final local rsync log live in that parent directory.

This brings to mind another question: I previously had the SIPs rsync from the output directory to a staging area for writing to LTO. This was deemed necessary since the processing computer and the machine attached to the LTO decks were not the same. Since they are now on the same host, I should introduce logic for that and skip the pointless movement of files on the same filesystem. (*open a new issue)

Note to self: check viability of hashdeep_audit on LTO since it requires the ability to os.chdir() to the target dir. My suspicion is that's a no-go. (*open a new issue)

One issue (is it an issue?) this SIP structure does create is that it breaks the logic that checks if a given object has been ingested and is already sitting in the output dir (based on the temp id that is formed from a hash of the filepath of the input object). I should change that logic to check something more robust (reading the db for an entry for an object with the same name? is that too strict?) or get rid of it and trust ourselves to not ingest something a million times. (*open new issue)

Here's the current SIP structure:

1a9430d4-e36e-47ae-85dd-9e7800e50ea5/ ├── 1a9430d4-e36e-47ae-85dd-9e7800e50ea5 │ ├── metadata │ │ ├── logs │ │ │ ├── e55f42d892_20180402_170139_ingestfile-log.txt │ │ │ ├── ffmpeg-3.4.2-1~16.04.york0.2_20180402_170148_makeDerivs.txt │ │ │ ├── ffmpeg-3.4.2-1~16.04.york0.2_20180402_170213_makeDerivs.txt │ │ │ ├── rsync_log_noise_05034_pm0003074_r01of05.mp4_20180402_170139.txt │ │ │ └── rsync_log_noise_05034_pm0003074_r02of05.mp4_20180402_170204.txt │ │ ├── noise_05034_pbcore.xml │ │ └── objects │ │ ├── noise_05034_pm0003074_r01of05.mp4_frame-md5.txt │ │ ├── noise_05034_pm0003074_r01of05.mp4_mediainfo.xml │ │ ├── noise_05034_pm0003074_r02of05.mp4_frame-md5.txt │ │ ├── noise_05034_pm0003074_r02of05.mp4_mediainfo.xml │ │ └── resourcespace │ │ ├── noise_05034_pm0003074_r01of05_lrp.mp4_mediainfo.xml │ │ └── noise_05034_pm0003074_r02of05_lrp.mp4_mediainfo.xml │ └── objects │ ├── noise_05034_pm0003074_r01of05.mp4 │ ├── noise_05034_pm0003074_r02of05.mp4 │ └── resourcespace │ ├── noise_05034_pm0003074_r01of05_lrp.mp4 │ └── noise_05034_pm0003074_r02of05_lrp.mp4 ├── hashdeep_audit_1a9430d4-e36e-47ae-85dd-9e7800e50ea5_2018-04-02T17-02-29.txt ├── hashdeep_manifest_1a9430d4-e36e-47ae-85dd-9e7800e50ea5_2018-04-02T17-02-28.txt └── rsync_log_1a9430d4-e36e-47ae-85dd-9e7800e50ea5_20180402_170229.txt

kieranjol commented 6 years ago

I guess the audit/manifest/log could sit within that UUID folder alongside your objects and metadata dirs, it's not the prettiest but it could work and it would remove the need for the parent duplicate ID?

Something like:

\1a9430d4-e36e-47ae-85dd-9e7800e50ea5/
├── hashdeep_audit_1a9430d4-e36e-47ae-85dd-9e7800e50ea5_2018-04-02T17-02-29.txt
├── hashdeep_manifest_1a9430d4-e36e-47ae-85dd-9e7800e50ea5_2018-04-02T17-02-28.txt
├── metadata
├── objects
│   └── noise_05034_pm0003074_r01of05.mp4
└── rsync_log_1a9430d4-e36e-47ae-85dd-9e7800e50ea5_20180402_170229.txt

You might need to alter copy scripts and such to reflect it but it could work...

mcampos-quinn commented 6 years ago

Yeah I think if I just make manifest for and validate the objects dir I can skip the parent folder altogether as you suggest. I'm not sure it's worth creating extra layers of confusing directories for the sake of including my metadata files in the audit...

Also, I don't think there's a pretty way to do this at all, so meh. The tree output above didn't paste correctly so it looks even uglier than it really is, but anything I can do to simplify is probably for the best.

And thanks again! Your comments are really appreciated!

kieranjol commented 6 years ago

Yeah, I guess the objects dir is the 'focus of preservation, so i think you'd be fine just having a manifest for that. Like we end up running different scripts at different times, so the logfile keeps getting updated, or we add extra files, so I end up having to write scripts that update the manifests a lot: https://github.com/kieranjol/IFIscripts/blob/master/ififuncs.py#L972

Actually a lot of my recent scripts have been based around logging changes and updating manifests. I think I'd feel a bit better having checksums for the whole package, but it definitely does complicate things if you need to update stuff.

mcampos-quinn commented 6 years ago

I had poked around in your update scripts and I think at some point I'll need to consider that type of process. I'm not totally sure yet how to approach it and until I come up with a specific need I think I will lurk a bit on this question.

In the mean time I think I will try out the structure mentioned above that validates just the objects and, uh, keep thinking about it. I'm hoping I can do a bit of research on this as well and make like an informed decision or something ;)

BAM-PFA / pymm

hashdeep manifest/audit file handling #3