MarcusBarnes / islandora_compound_batch

Provides the basic ability to batch import compound objects into Islandora.
GNU General Public License v3.0
3 stars 12 forks source link

Sample Batch Set of Compound Object Not Being Ingested properly #2

Closed MarcusBarnes closed 7 years ago

MarcusBarnes commented 7 years ago

@CadenArmstrong has provided me with a sample batch set of compound compound objects that are not being ingested as expected. The sample adheres to the documentation in the README. The drush commands appear to work as expected, but when you visit the --parent object, no items appear. Viewing the object PIDs, you see all the items in batch, rather than seperate compound objects.

Possible source of problem to investigate:

Double check Islandora versions where islandora_compound_batch is being used without issue against versions of modules in islandora_vagrant. Use test compound objects in islandora_vagrant to see if the issue is now appearing for batch sets that Marcus knows use to function as expected.

MarcusBarnes commented 7 years ago

The initial parent_pid from the islandora_compound_batch database table for the given batch_id is used in RELS-EXT of all subsequent compound objects in the batch.

The code in https://github.com/MarcusBarnes/islandora_compound_batch/blob/master/includes/object.inc that deals with adding the relationships is no longer functioning as expected. This is where I'll be looking next in order to help debug and resolve this issue.

MarcusBarnes commented 7 years ago

Progress update. I have isolated what is causing the issue and am working on a fix. I'll create a pull-request once I've completed the fix. Thank you.

MarcusBarnes commented 7 years ago

I've identified two assumptions that resulted in this issue. The first was that value of the content attribute of the child element in the generated structure files would be integers. This has been addressed and I've pushed the changes to https://github.com/MarcusBarnes/islandora_compound_batch/tree/issue-2 branch. The second assumption was that the value of the content attribute of the child elements generated by the structure file would be unique.

This module was previously successfully used during a migration from CONTENTdm where the folder structure included the "pointer" as the folder name for child objects. This, plus having the batch set ID provided a way for inferIslandoraCompoundBatchRow row to either return a unique row or null (See https://github.com/MarcusBarnes/islandora_compound_batch/blob/master/includes/object.inc#L89). This allows Islandora compound batch to set the parent-child relationships.

As a temporary fix, those interested in using this module should consider prefix the folder names of child objects using the folder name of the parent folder. Using the folder structure from the README, this would look something like this:

batch_directory/
├── compound_object_1
│   ├── compound_object_1_child_1
│   ├── compound_object_1_child_2
│   └── compound_object_1_child_3
├── compound_object_2
│   ├── compound_object_2_child_1
│   └── compound_object_2_child_2
└── compound_object_3
    ├── compound_object_3_child_1
    ├── compound_object_3_child_2
    ├── compound_object_3_child_3
    └── compound_object_3_child_4

Using unique folder names should also work, but prefixing provides additional context.

@mjordan Do you think requiring unique folder names would be too burdensome? I'd like to avoid this requirement if possible. I plan on investigating further if there's a way of adding more information during the preprocessing phase of objects which can then be used during the ingest phase to get a unique row (or null) from inferIslandoraCompoundBatchRow.

@CadenArmstrong So that you can progress with your work, please clone the issue-2 branch and adjust the folder names in your package appropriately.

mjordan commented 7 years ago

@MarcusBarnes yes, I think we should avoid requiring unique folder names.

Assuming that the parent (i.e., compound object) folder name is unique (and this is probably a safe assumption since all of these folders sit in the same folder), would a simple fix be to have the structure file generation script concatenate the compound and the child folder name to provide a unique value for the content attribute for the child element?

MarcusBarnes commented 7 years ago

@mjordan Agreed. Good suggestion for a fix - I'll put aside soon to see if I can fully resolve this issue (not requiring unique folder names).

mjordan commented 7 years ago

Just tested the issue-2 branch, but using a package directory structure that does provide unique directory names, e.g.,

/input_directory/
├── 92
│   ├── 90
│   │   ├── MODS.xml
│   │   └── OBJ.mp3
│   ├── 91
│   │   ├── MODS.xml
│   │   └── OBJ.mp3
│   ├── MODS.xml
│   └── structure.xml
└── 95
    ├── 93
    │   ├── MODS.xml
    │   └── OBJ.mp3
    ├── 94
    │   ├── MODS.xml
    │   └── OBJ.mp3
    ├── MODS.xml
    └── structure.xml

And all parent-child relationships were generated as expected.

A heads up for people testing this issue, the issue-2 branch changes the database schema, but (as @MarcusBarnes and I agreed we'd do at this early stage) there is no corresponding update hook in the .install file. Unistalling and then reinstalling the module will rebuild the db tables. If you have the devel contrib module installed, running drush devel-reinstall islandora_compound_batch will do this. Note that rebuilding the db tables will also delete all your batch sets.

MarcusBarnes commented 7 years ago

@mjordan @CadenArmstrong Using the suggestion from @mjordan, I've been able to remove the requirement of using unique directory_names for child objects. Top-level parent objects should still have unique names. You'll need to use the create_structure_files.php utility in extras/scripts to create the structure.xml files. If you are able to test, please let me know and give things a try by pulling the latest commits for the issue-2 branch. Thank you!

(@mjordan - If the above works satisfactorily, we may want to update the MIK CONTENTdm compound toolchain to take this into account: leave creating the structure.xml to the compound batch module.)

MarcusBarnes commented 7 years ago

N.B. Before testing, please disable, uninstall, then reinstall the islandora_compound_batch module.

mjordan commented 7 years ago

@MarcusBarnes I have no problem using create_structure_files.php to generation of the structure.xml files, but if the introduction of $state in the structure.xml files isn't inconsistent with the output of the MIK CONTENTdm compound toolchain, maybe we can leave it as is? Or maybe open an issue with MIK to note the change in the structure.xml and leave it for now? I'd appreciate hearing your thoughts.

MarcusBarnes commented 7 years ago

@mjordan The $state variable is a way of recording the depth via the parent directory name, allowing a way to have a unique value for the child content attribute. I'll need to double check how we are creating the structure.xml file in the CONTENTdm compound toolchain and see if it's simple to make it compatible with what create_structure_files.php utility does now.

MarcusBarnes commented 7 years ago

@mjordan Regarding the MIK CONTENTdm compound toolchain, it optionally outputs the CONTENTdm CPD file and saves it as "structure.xml". What I would suggest is that we update the MIK CONTENTdm compound writers to save the file using cpd.xml or some other name so that the create_strcuture_files.php utility included with Islandora Compound Batch can create the structure.xml file. This should resolve any compatibility issues and allow us to resolve this issue. Thoughts?

mjordan commented 7 years ago

@MarcusBarnes Perhaps we shouldn't build into MIK the ability to generate any structure files if the more general workflow is to use the included script. I'd be happy to remove this functionality completely from MIK. One less thing to keep synchronized. I can do the work of removing it.

MarcusBarnes commented 7 years ago

@mjordan That sounds good. As a double check, have you heard from anyone that they'd like to preserve the Cpd file when exporting from CONTENTdm (say, as a datastream in Islandora)? That would be the only reason for keeping the output of the Cpd file content in the MIK CONTENTdm compound tool chain (but removing the code that transforms the structure.xml file).

mjordan commented 7 years ago

As a double check, have you heard from anyone that they'd like to preserve the Cpd file when exporting from CONTENTdm (say, as a datastream in Islandora)?

I haven't. There are other ways of capturing the contents of the CONTENTdm .cpd file, such as the AddContentdmData metadata manipulator. If anyone wanted to add the file as a datastream they could write a simple post-write hook to dump it into the compound object's output directory.

MarcusBarnes commented 7 years ago

Addressed in 4c711d6.

mjordan commented 7 years ago

@MarcusBarnes I'll open a ticket at MIK to remove the generation of structure files.