bondjimbond / islandora_westvault

Automated preservation for Islandora objects
GNU General Public License v3.0
1 stars 1 forks source link

Validate bags before updating and syncing #41

Closed bondjimbond closed 5 years ago

bondjimbond commented 5 years ago

Addresses #22

Adds validation functions to check bag contents against the object's datastreams.

If any datastream is not represented by a file in the Bag, the error is logged and the Bag is deleted. The object's PRESERVATION datastream is not updated, so it will still be ready to be preserved next time.

To test:

bondjimbond commented 5 years ago

@mjordan This ensures that at least a basic Bag is present. We can deal with the specifics of different plugins another time. What do you think of this approach?

bondjimbond commented 5 years ago

@mjordan I don't expect you'll have time to test this right away, but have you any thoughts about the approach?

mjordan commented 5 years ago

I agree with this approach, but have a couple of comments:

bondjimbond commented 5 years ago

We should be careful with using 'validate' or 'valid' in reference to Bags, since they a specific meaning, which is defined at https://tools.ietf.org/html/rfc8493#page-14. Validating a Bag does not including determining if a particular file is present in data/, only that the Bag must be "complete" as defined in the spec and that "Every checksum in every payload manifest and tag manifest has been successfully verified against the contents of the corresponding file."

So that is covered by the if ($bag->isValid()) check, correct?

In terms of the language used in the messages, I think that's covered... we have "is valid" or "does not validate" based on the check above, and the message from the checks I added just says "is missing files". Does that all check out?

I think "We can deal with the specifics of different plugins another time." is reasonable but dangerous in that if your users start enabling additional Islandora Bagit plugins and then creating Bags for deposit into WestVault, you risk having Bags there that might not meet your criteria.

Can you elaborate on the last bit (bags that might not meet my criteria)?

For an MVP, assuming this PR passes testing besides my own, I think this brings us there... but we will certainly want to add some improvements over time to support better checking with specific plugins.

mjordan commented 5 years ago

So that is covered by the if ($bag->isValid()) check, correct?

Yes, but that's all it does.

Can you elaborate on the last bit (bags that might not meet my criteria)?

Your criteria are that the Bag contains a specific set of files, e.g., OBJ.something, MODS.xml, etc. which Bag validation doesn't address.

bondjimbond commented 5 years ago

Your criteria are that the Bag contains a specific set of files, e.g., OBJ.something, MODS.xml, etc.

Well, that's what the checks introduced here are for... We check the bag contents for files, and compare against the list of datastreams. In theory, this code ensures that every Bag contains all expected files. It's possible due to plugin problems that they could be in the wrong subdirectories, perhaps, but other than that, the minimum criteria of "each datastream has a file" should be met here.

If the other plugins insert extra files that I'm not accounting for, that's a later step.

Unless I'm missing something? Or are we both saying the same thing but from different angles?

mjordan commented 5 years ago

We're saying the same thing. I'll try test this over the weekend.

bondjimbond commented 5 years ago

Cool, thanks! Unless I'm mistaken, once this is done we're pretty much set for a beta release, yes?

bondjimbond commented 5 years ago

Any chance you had an opportunity to test this, @mjordan?

mjordan commented 5 years ago

@bondjimbond testing didn't go as planned. After modifying the drush sript, for the "Un-preserve your object, then preserve again" step, I unchecked the "Preserve this object" box, saved, then checked it and resaved. Then reran drush -u 1 westvault-bagit --debug. There was no error, drush showed green:

[...]
Bag created and saved at sites/default/files/Bag-islandora_7.tgz [1.08 sec, 35.43 MB]                                                                     [status]
Files added: RELS-EXT.rdf, MODS.xml, DC.bin, OBJ.jpg, TECHMD.xml, TN.jpg, MEDIUM_SIZE.jpg, PRESERVATION.bin [1.08 sec, 35.43 MB]                          [status]
[...]

The bag was still in my owncloud directory. I also repeated these steps after clearing my cache.

Did I test it correctly?

bondjimbond commented 5 years ago

Hmm, did you delete the bag from the Owncloud directory before making the modification? (I should have included that in my testing steps, sorry.)

Also, try adding dd($bag_contents) at line 194 and check what's in position ['zero']; if that's not one of the datastream files, then the check won't pick it up. So instead pick whichever array key corresponds to a more important file.

mjordan commented 5 years ago

Sorry, still not getting it. I added the following to the drush command:

// FOR TESTING ONLY     
 unset($bag_contents['3']);

which should be removing the OBJ, but it's not. When I unzip the resulting bag, the OBJ is there. Consequently, there is no error.

bondjimbond commented 5 years ago

The idea here is that it unsets the file from the array that reports the files in the bag and before the two arrays are compared. It doesn't actually erase the OBJ, it just tricks the module into thinking the OBJ is missing.

So what should be happening:

If you followed all the steps above, but the newly-created Bag is not actually being deleted -- can you check the object's PRESERVATION datastream and report its contents?

mjordan commented 5 years ago

Followed the steps (plus deleting the bag before making the modification to the drush command), still no dice. There is no logged error in the drush output or in watchdog. The bag file is regenerated. Here is the object's PRESERVATION ds contents:

`

2019-07-09 15:03:5015626846301562684642

`

bondjimbond commented 5 years ago

Hmm... Can you report the result of dd($bag_contents) (put right after the line unset($bag_contents[0]), and dd($datastreams) right afterward?

mjordan commented 5 years ago

Added those lines, unpreserved, then represerved the object, and reran the drush command:

Array
(
    [1] => MEDIUM_SIZE
    [2] => MODS
    [3] => OBJ
    [4] => PRESERVATION
    [5] => RELS-EXT
    [6] => TECHMD
    [7] => TN
    [zero] => DC
)

Array
(
    [1] => RELS-EXT
    [2] => MODS
    [3] => DC
    [4] => OBJ
    [5] => TECHMD
    [6] => TN
    [7] => MEDIUM_SIZE
    [8] => PRESERVATION
)
mjordan commented 5 years ago

Heading to work on, can pick this up this evening.

bondjimbond commented 5 years ago

So then after that you unset $bag_contents[3], right? But before the comparison script?

mjordan commented 5 years ago

Here are my changes:

191   // Prevents an incorrect 'false' result in the array_search below.
192     $bag_contents['zero'] = $bag_contents[0];
193     unset($bag_contents[0]);
194     dd($bag_contents);
195     dd($datastreams);
196   // Compare $datastreams to $bag_contents.

and

205 // FOR TESTING ONLY     
206 unset($bag_contents['3']);
207     if (!isset($is_valid)) {
208       $is_valid = TRUE;
209     }
bondjimbond commented 5 years ago

@mjordan Crap... Did I give you the wrong line for the unset?

It should go right before this block of code:

  // Compare $datastreams to $bag_contents.
    foreach ($datastreams AS $datastream) {
      $datastream_locator = array_search($datastream, $bag_contents, TRUE);

      if ($datastream_locator == FALSE) {
        drush_log(dt("Bag at !path missing the !datastream datastream.", array('!path' => $bag_path, '!datastream' => $datastream), 'warning'));
        $is_valid = FALSE;
      }
    }
mjordan commented 5 years ago

Is it the "FOR TESTING ONLY" unset() that I put there?

bondjimbond commented 5 years ago

Yep. Stick that right above the // Compare $datastreams to $bag_contents. comment and it should work.

mjordan commented 5 years ago

OK, that prevented the bag from being created, and Drush told me "Bag /tmp/owncloud/Bag-islandora_7.tgz was is missing files and will be removed. Check directory permissions and try again." Is that the error I am looking for? If so, I recommend writing an entry to the watchdog since an admin might not know that the bag was not created.

mjordan commented 5 years ago

Or maybe some other even more in-your-face way of telling the admin there was a problem, like an email. Islandora Checksum Checker, for example, sends an email on checksum mismatch.

bondjimbond commented 5 years ago

@mjordan Yes, that's the error you want. So how would you suggest it get logged... Watchdog?

mjordan commented 5 years ago

I think Watchdog at a minimum.

bondjimbond commented 5 years ago

Let's do Watchdog for now. Once this is merged, another issue for emailing the admin.

With the latest commit, would you say this passes your tests?

mjordan commented 5 years ago

Yup!

bondjimbond commented 5 years ago

Thanks!

mjordan commented 5 years ago

Here's how Checksum Checker does it:

https://github.com/Islandora/islandora_checksum_checker/blob/7.x/islandora_checksum_checker.module#L415