FLVC / flvc

FLVC-specific Islandora Hooks
0 stars 2 forks source link

ZIP importer and MODS "replace" functions accept invalid MODS without complaint #45

Closed lydiam closed 5 months ago

lydiam commented 8 years ago

Mike discovered earlier today that a few serials records created during the DigiTool migration contain invalid MODS files, so I did a bit of testing on the islandora-test server and found that both the ZIP importer and the "replace" function for the MODS DSIS allow for loading of invalid MODS files without complaint.

An example on islandora-test, loaded via ZIP loader: https://islandora-test.digital.flvc.org/islandora/object/islandora-test%3A21435

You can download the MODS file and pass it through a validator and find that it's well-formed XML but invalid MODS. (I believe that XML that's not well-formed will produce an error).

Similarly, when invalid MODS was uploaded via the "replace" function it was accepted. See: https://islandora-test.digital.flvc.org/islandora/object/islandora-test%3A19994. (We've known that the replace function doesn't check for PURLs or duplicate IIDs, but unfortunately have had higher priority issues, but invalid MODS is an urgent issue.)

We'll need to do more testing and will need to report this to Islandora as well. I'm not sure how we can identify all cases of existing invalid MODS already uploaded.

lydiam commented 8 years ago

Note that the ExceltoMODS transformer service does perform MODS validation, so any MODS files created by that service are valid.

lydiam commented 8 years ago

From Islandora documentation:

"XML Schema XML schemas are used to validate XML documents. The XML document is compared to a particular schema in order to test its validity in a specific context. In Islandora the metadata schemas are frequently used by XML Forms to create and validate ingest forms."

https://wiki.duraspace.org/display/ISLANDORA714/APPENDIX+E+-+Glossary

So it looks like metadata schema validation can somehow be wrapped into Forms: does anyone know how?

mdemers commented 8 years ago

I'm not positive this means that it checks the file against the schema upon submission but I believe I know how to insert where this is referenced in the form builder. Right now we have the Schema on our forms blank. edit form - islandora-test digital flvc org 2015-10-23 14-02-16

mdemers commented 8 years ago

This is more specific in how to reference the schema: https://wiki.duraspace.org/pages/viewpage.action?pageId=64326582#HowtoEdit/CreateIngestForms-SettingFormProperties I'll see what happens when it's included.

mdemers commented 8 years ago

Ugh. I don't think adding that does anything on it's own. There's more defined here: https://wiki.duraspace.org/pages/viewpage.action?pageId=64326582#HowtoEdit/CreateIngestForms-Addingformfields

On each element in the form there are Schema fields for both the Create and Update functions. I never paid much attention to them before. The tooltip given: Schema: "An XPath to the definition of this element's parent. The XPath is executed in the schema defined in this form's properties. This is used to determine the insert order for this element"

If this means what I think it means I'll have to enter an XPath that will go out and dig down into the xsd document I've defined in the form properties. I can't find any examples of this implemented in any islandora forms and I'm not quite sure how it works in practice.

On top of that there's also a section called "Validate" in the advanced options where you can "functions". No further explanation given.

This could be painful.

wrandtkeflvc commented 7 years ago

I believe that the current status of this is that: The zip loader will accept invalid MODS, but the replace function will not.

My understanding is that the zip loader is designed to fail gracefully and try and load a whole set of items even if some items are bad. Implementing a fix probably would have to involve a step 1 of checking over the MODS for all items in the load and rejecting the entire load with an error message if any MODS were invalid. That could lead to frustration for users when they stall near the beginning. It's probably better than the current situation of introducing hidden problems that bubble up later.