Simple Archive toolchain

bondjimbond commented 5 years ago

I've got a DSpace repository to migrate, and aside from OAI-PMH, the most promising of their export options seems to be the Simple Archive format.

This provides a directory full of subdirectories, each subdirectory representing an object, containing the main object file (PDF or JPG or whatever), metadata, and various other datastreams (or "bitstreams" in DSpace terms).

At minimum, every directory will have a dublin_core.xml file and the main object file.

A sample export is attached.

Basically, the toolchain will need to loop through the subdirectories, and in each subdirectory, do the following:

Identify and extract the main object file based on extension.
Extract the dublin_core.xml file and transform it to MODS according to the user's mapping file.
Spit everything into one batch-ready directory with files renamed as appropriate - and make sure that any duplicates are rewritten so that they aren't overwriting (e.g. if two different objects are called "file.pdf" then make the next one "file_1.pdf" etc.)

partial_export_2019_Jun_04_3_5.zip

MarcusBarnes commented 5 years ago

Thank you @bondjimbond for the use-case. If there are any other groups interested in using this kind of toolchain to migrate from DSpace, please chime in the comments so that I can gauge interest. Thank you in advance.

mjordan commented 5 years ago

Looking at the Simple Archive Format now. You will likely need to write all of the components in a new toolchain as described at https://github.com/MarcusBarnes/mik/wiki/Information-for-developers#writingcomponents.

The fetcher would get all of the directories that are immediate children of the archive_directory, and grab the DC file from within each item directory
You might be able to use an existing metadata parser because you've got DC in each directory
The filegetter would need to be able to grab each of the files named in contents (if I am understanding what is in contents)
The writer would need to take all of these things and output Islandora ingest packages.

This may sound like a lot, but it's really just creating a new class file for each of those components, implementing the required methods. If you can reuse the DC metadata parser, you've saved some work there.

bondjimbond commented 5 years ago

@mjordan @MarcusBarnes It's starting to look like this is going to be the best approach... their implementation of OAI is too weird for MIK to handle as-is.

I don't see a DC-to-MODS metadata parser. Would creating a Twig template make sense here?

This is a sample DC XML file:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<dublin_core schema="dc">
  <dcvalue element="contributor" qualifier="author" language="">Hayman,&#x20;Richard</dcvalue>
  <dcvalue element="date" qualifier="accessioned" language="">2014-02-13T20:44:02Z</dcvalue>
  <dcvalue element="date" qualifier="available" language="">2014-02-13T20:44:02Z</dcvalue>
  <dcvalue element="date" qualifier="issued" language="">2009</dcvalue>
  <dcvalue element="identifier" qualifier="citation" language="en_US">Hayman,&#x20;R.&#x20;(2009).&#x20;Human&#x20;rights&#x20;software:&#x20;Information&#x20;support&#x20;solutions&#x20;for&#x20;social&#x20;justice.&#x20;Information&#x20;for&#x20;Social&#x20;Change,&#x20;29,&#x20;44-67.</dcvalue>
  <dcvalue element="identifier" qualifier="issn" language="">1756-901X</dcvalue>
  <dcvalue element="identifier" qualifier="uri" language="">http:&#x2F;&#x2F;hdl.handle.net&#x2F;11205&#x2F;98</dcvalue>
  <dcvalue element="description" qualifier="abstract" language="en_US">Human&#x20;rights&#x20;centres&#x20;and&#x20;non-governmental&#x20;organizations&#x20;(NGOs)&#x20;have&#x20;crucial&#x20;information&#x20;support&#x20;needs,&#x20;many&#x20;of&#x20;which&#x20;can&#x20;be&#x20;met&#x20;by&#x20;the&#x20;existing&#x20;and&#x20;ongoing&#x20;development&#x20;of&#x20;information&#x20;technology&#x20;software&#x20;applications.&#x20;For&#x20;communication&#x20;and&#x20;Internet&#x20;use,&#x20;the&#x20;psiphon&#x20;program&#x20;allows&#x20;for&#x20;secure&#x20;and&#x20;anonymous&#x20;information&#x20;exchange&#x20;and&#x20;distribution,&#x20;including&#x20;firewall&#x20;circumvention.&#x20;For&#x20;data&#x20;collection,&#x20;organization,&#x20;encryption,&#x20;and&#x20;storage,&#x20;Martus&#x20;software&#x20;can&#x20;be&#x20;deployed&#x20;to&#x20;help&#x20;protect&#x20;sensitive&#x20;information&#x20;and&#x20;identities.&#x20;Based&#x20;on&#x20;documented&#x20;projects&#x20;and&#x20;websites,&#x20;the&#x20;following&#x20;research&#x20;examines&#x20;these&#x20;emancipatory&#x20;tools&#x20;to&#x20;determine:&#x20;the&#x20;technologies&#x20;in&#x20;use,&#x20;emergent,&#x20;and&#x20;under&#x20;development;&#x20;their&#x20;possible&#x20;usage&#x20;in&#x20;the&#x20;critical&#x20;arenas&#x20;under&#x20;discussion;&#x20;and,&#x20;the&#x20;greater&#x20;effects&#x20;of&#x20;these&#x20;technologies&#x20;as&#x20;they&#x20;relate&#x20;to&#x20;social&#x20;justice&#x20;and&#x20;information&#x20;access&#x20;in&#x20;the&#x20;global&#x20;information&#x20;society.&#x20;The&#x20;purpose&#x20;is&#x20;to&#x20;raise&#x20;awareness&#x20;within&#x20;human&#x20;rights&#x20;communities&#x20;and&#x20;information&#x20;centres&#x20;about&#x20;the&#x20;existence&#x20;and&#x20;availability&#x20;of&#x20;these&#x20;tools,&#x20;so&#x20;that&#x20;these&#x20;groups&#x20;may&#x20;find&#x20;appropriate&#x20;and&#x20;accessible&#x20;solutions&#x20;that&#x20;match&#x20;their&#x20;information&#x20;support&#x20;needs.&#x20;Further,&#x20;it&#x20;is&#x20;hoped&#x20;that&#x20;the&#x20;information&#x20;presented&#x20;here&#x20;will&#x20;generate&#x20;open,&#x20;intercultural,&#x20;and&#x20;international&#x20;discussions&#x20;of&#x20;human&#x20;rights&#x20;policy&#x20;development,&#x20;strategic&#x20;planning,&#x20;and&#x20;implementation.</dcvalue>
  <dcvalue element="description" qualifier="provenance" language="en">Submitted&#x20;by&#x20;Richard&#x20;Hayman&#x20;(rhayman@mtroyal.ca)&#x20;on&#x20;2014-02-13T20:44:02Z&#x0D;&#x0A;No.&#x20;of&#x20;bitstreams:&#x20;2&#x0D;&#x0A;Human&#x20;Rights&#x20;Software.pdf:&#x20;162558&#x20;bytes,&#x20;checksum:&#x20;faff81a56afb49c3924691f538d9fd5e&#x20;(MD5)&#x0D;&#x0A;license_rdf:&#x20;1232&#x20;bytes,&#x20;checksum:&#x20;dffe24314e72b44d3936efcc12015d3f&#x20;(MD5)</dcvalue>
  <dcvalue element="description" qualifier="provenance" language="en">Made&#x20;available&#x20;in&#x20;DSpace&#x20;on&#x20;2014-02-13T20:44:02Z&#x20;(GMT).&#x20;No.&#x20;of&#x20;bitstreams:&#x20;2&#x0D;&#x0A;Human&#x20;Rights&#x20;Software.pdf:&#x20;162558&#x20;bytes,&#x20;checksum:&#x20;faff81a56afb49c3924691f538d9fd5e&#x20;(MD5)&#x0D;&#x0A;license_rdf:&#x20;1232&#x20;bytes,&#x20;checksum:&#x20;dffe24314e72b44d3936efcc12015d3f&#x20;(MD5)&#x0D;&#x0A;&#x20;&#x20;Previous&#x20;issue&#x20;date:&#x20;2009</dcvalue>
  <dcvalue element="language" qualifier="iso" language="en_US">en</dcvalue>
  <dcvalue element="publisher" qualifier="none" language="en_US">Information&#x20;for&#x20;Social&#x20;Change</dcvalue>
  <dcvalue element="rights" qualifier="none" language="*">Attribution-NonCommercial-NoDerivs&#x20;2.5&#x20;Canada</dcvalue>
  <dcvalue element="rights" qualifier="uri" language="*">http:&#x2F;&#x2F;creativecommons.org&#x2F;licenses&#x2F;by-nc-nd&#x2F;2.5&#x2F;ca&#x2F;</dcvalue>
  <dcvalue element="subject" qualifier="none" language="en_US">Human&#x20;rights</dcvalue>
  <dcvalue element="subject" qualifier="none" language="en_US">Social&#x20;justice</dcvalue>
  <dcvalue element="subject" qualifier="none" language="en_US">Librarianship</dcvalue>
  <dcvalue element="title" qualifier="none" language="en_US">Human&#x20;Rights&#x20;Software:&#x20;Information&#x20;Support&#x20;Solutions&#x20;For&#x20;Social&#x20;Justice</dcvalue>
  <dcvalue element="type" qualifier="none" language="en_US">Article</dcvalue>
  <dcvalue element="publisher" qualifier="uri" language="en_US">http:&#x2F;&#x2F;libr.org&#x2F;isc&#x2F;</dcvalue>
  <dcvalue element="metadata" qualifier="review" language="">Edited,&#x20;TR</dcvalue>
</dublin_core>

bondjimbond commented 5 years ago

@mjordan So trying to figure out what is needed for a new toolchain like this... I'm not 100% certian what each component does, so this might be inaccurate.

We'll need a new Fetcher for sure... Perhaps the Filesystem fetcher can be tweaked? It will need to get a list of subdirectories, use the subdirectory names as identifiers, and fetch the dublin_core.xml file (and rename it to the identifier I guess?)

Metadata Parser: Could I just use the Templated one? But will need to figure out how to turn something like dcvalue element="contributor" qualifier="author" into a Twig variable.

Filegetter: I'm not sure whether the CONTENTS file is the way to go or not. Here's a sample of the contents of a CONTENTS file:

5 Gamification Paradigm.pdf bundle:ORIGINAL
license_rdf bundle:CC-LICENSE
license.txt bundle:LICENSE
5 Gamification Paradigm.pdf.txt bundle:TEXT description:Extracted text

The license files can be ignored. The important files are the PDF (or other filetype) and the TXT.

We also have objects with multiple files in them, which look like they'll be compound objects... That's more complicated.

Writer: Do we need a new writer, then? I'm not sure how these things really work.

mjordan commented 5 years ago

@bondjimbond admittedly, the required methods, etc. of each type of class are not thoroughly documented, but the "Required properties and functions in subclassed components" section of https://github.com/MarcusBarnes/mik/wiki/Information-for-developers (which you probably have already seen) does provide some detail.

A twig template for the metadata parser is a good idea, but since you're getting XML from the source, maybe an XSLT (DC->MODS) would be a better fit?

bondjimbond commented 5 years ago

@mjordan Yes, so long as it's good enough to meet our requirements. LoC transforms are often inadequate.

This is the latest one: https://www.loc.gov/standards/mods/v3/DC_MODS3-5_XSLT1-0.xsl

Have we got a way in MIK to apply an XSLT to a metadata file?

mjordan commented 5 years ago

I can help with that.

bondjimbond commented 5 years ago

Thanks. :) I think I'm going to need a lot of help developing this..

mjordan commented 5 years ago

@bondjimbond will this XSL, or some modified version of it, do the trick? http://www.loc.gov/standards/mods/v3/DC_MODS3-5_XSLT1-0.xsl

bondjimbond commented 5 years ago

Yeah, that's the plan. We should leave it open to customization (allow user to point to a locally-stored XSLT rather than hard-code the remote one), but that's a good default.

mjordan commented 5 years ago

OK, let me whip up the start of a metadata parser that takes in a DC file and writes out a corresponding MODS file.

bondjimbond commented 5 years ago

Awesome, thanks!

mjordan commented 5 years ago

@bondjimbond bad news - that LoC stylesheet won't work with the sample DC file you provide above.

bondjimbond commented 5 years ago

@mjordan I'm not surprised; I expect it will need some modification.

First step is to get the parser working on a standard DC document: https://arcabc.ca/islandora/object/dc%3A34849/datastream/MODS/view

I should be able to get my hands on the transform file that they use to create their OAI output; that will be the "custom" one that we should be able to drop in.

mjordan commented 5 years ago

I've got an Xslt.php metadata parser, based on the Templated.php parser but instead of using Twig to generate the output, it uses a stylesheet. It doesn't hard code DC or MODS, it could be used with any stylesheet. It's hard to test it in use without the other toolchain components though. @bondjimbond have you done any work on the fetcher or filegetter yet? Want me to share my parser so you can have a look? Not sure what the best way to proceed is.

bondjimbond commented 5 years ago

Please share your parser. I have looked at the fetcher and filegetter files, but I haven't tried any code yet... still trying to understand how they work.

mjordan commented 5 years ago

OK, it's attached. It requires a new .ini setting: METADATA_PARSER']['stylesheet']. I haven't tested it within a toolchain yet, but all the parts are there. One thing I'm not sure about is whether we'll need to register namespaces to use it on some XML files. If so, I'd say that the .ini file is the place to register them, so it's extensible beyond DC->MODS. But maybe we could hard code them for now to get going.

To use this, put it in the mik\metadataparsers\templated directory. The [METADATA_PARSER] section of your .ini file should look like this:

[METADATA_PARSER]
class = templated\Xslt
stylesheet = mystylesheet.xsl

Xslt.php.txt

mjordan commented 5 years ago

Woops, use this one.

Xslt.php.txt

mjordan commented 5 years ago

@bondjimbond the fetcher will need to extract the Simple Archive Format zip and iterate through each directory. The record_key for each item should be the directory name. The fetcher's getNumRecs method should return the number of item subdirectories. Its getRecords method will return an array of all the dublin_core.xml documents, keyed using the record_key.

Your filegetter's getChildren method will return an empty array (no children), and its getFilePath method will return the path to the PDF, etc. I think this would be the path to the file identified in the 'contents' manifest line that contains bundle:ORIGINAL.

The writer writes out the files generated/retrieved by the other toolchain components. I think you should be able to create a writer based on the existing CsvSingleFileJson.php writer class. You will need to adjust the details withn https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104, but everything from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104 should work as is. The writeMetadataFile method should work as is too, I think. Of course, rename your class in https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L13 from 'CsvSingleFileJson' to 'SimpleArchiveFormat' and make sure the PHP file has the same name.

bondjimbond commented 5 years ago

@mjordan Thanks. I think I can skip the zip part and process as an unzipped directory tree to begin with, till everything else is complete.

So for the Fetcher: it looks like the Filesystem fetcher might be a good place to start. But I'm having trouble grasping what it's doing, exactly.

The record_key for each item should be the directory name

So in https://github.com/MarcusBarnes/mik/blob/master/src/fetchers/Filesystem.php#L40-L54:

it looks like the code as-is should already be taking the directory name as the record_key; is that right?

And then in https://github.com/MarcusBarnes/mik/blob/master/src/fetchers/Filesystem.php#L93-L103..

Would this do the trick to acquire the DC files?

        $files_with_name = $this->input_directory . DIRECTORY_SEPARATOR . $record_key . DIRECTORY_SEPARATOR . "dublin_core.xml";

bondjimbond commented 5 years ago

I think you should be able to create a writer based on the existing CsvSingleFileJson.php writer class. You will need to adjust the details withn https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104, but everything from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104 should work as is.

@mjordan Those are the same lines... Which are the ones that need changing?

I'm working on some of these, but operating blind until I test them, I guess. I may have some questions for you soon.

bondjimbond commented 5 years ago

@mjordan Well I'm giving it a shot, but I do foresee a lot of failure.

First attempt crashed out on the Fetcher with a very unhelpful error message:

Fatal error: Uncaught mik\exceptions\MikErrorException in /Users/brandon/sfuvault/mik/mik:105
Stack trace:
#0 /Users/brandon/sfuvault/mik/src/fetchers/Filesystem_Subdirectories.php(61): {closure}(8, 'Undefined varia...', '/Users/brandon/...', 61, Array)
#1 /Users/brandon/sfuvault/mik/src/fetchers/Filesystem_Subdirectories.php(81): mik\fetchers\Filesystem_Subdirectories->getRecords()
#2 /Users/brandon/sfuvault/mik/mik(131): mik\fetchers\Filesystem_Subdirectories->getNumRecs()
#3 {main}
  thrown in /Users/brandon/sfuvault/mik/mik on line 105

Filesystem_Subdirectories.php.txt mods.xsl.txt mru_xsl.ini.txt SimpleArchive.php.filegetter.txt SimpleArchive.php.writer.txt

mjordan commented 5 years ago

@bondjimbond I've done a bit of work to these and am attaching a zip containing my versions. Your .ini should work with them. This toolchain produces output, but the .xml files are not being created properly. The stylesheet is not transforming the dublin core in each item's dublin_core.xml file to MODS.

mik501code.zip

bondjimbond commented 5 years ago

Thanks, @mjordan! I don't get output with these, but I do get errors that produce possibly helpful direction.

[2019-09-03 15:24:24] ErrorException.ERROR: ErrorException {"message":"file_get_contents(mods.xsl): failed to open stream: No such file or directory",
...
"file":"/Users/brandon/sfuvault/mik/src/metadataparsers/templated/Xslt.php","line":78} []

At line 78 I have this code: $xslt = file_get_contents($this->xslt_stylesheet_path);

In my .ini file, I have: stylesheet = "mru/mods.xsl"

I had originally had this without quotes, but it's the same result. The actual mods.xsl file is in mik/mru/mods.xsl.

Is there something wrong with my path, perhaps?

bondjimbond commented 5 years ago

Update: It's not the path. Same error even if I set the path to stylesheet = "/users/Brandon/sfuvault/mik/mru/mods.xsl"

bondjimbond commented 5 years ago

Update again: Problem seems to be the code... $xslt = file_get_contents($this->xslt_stylesheet_path); kills the whole function (tested with print statements before and after). But $xslt = file_get_contents("/users/Brandon/sfuvault/mik/mru/mods.xsl"); actually produces results (albeit problematic ones).

bondjimbond commented 5 years ago

One more update:

I tried using standard DC files and the standard LOC transform. If I hard-code line 78 of Xslt.php to the standard DC-to-MODS XSL, it works!

So there are two tasks ahead...

For the toolchain, make it accept a transform file from the .ini settings
For my own stuff, make the DC and XSLT match

mjordan commented 5 years ago

@bondjimbond can you make the following change to see if it resolves the path issue? Change file_get_contents($this->xslt_stylesheet_path); to file_get_contents(realpath($this->xslt_stylesheet_path)); and then try to use stylesheet = "mru/mods.xsl" in your config?

bondjimbond commented 5 years ago

@mjordan New error: "Filename cannot be empty" (referring to same line)

mjordan commented 5 years ago

Hm, it worked for me. It is possible I zipped up the wrong file. Anyway, my laptop is at home so I won't be able to investigate until this evening. Sounds like we're making progress though.

mjordan commented 5 years ago

Just opened the zip to take a look. Problem is in line 20 of the metadataparser. Can you change it to $this->xslt_stylesheet_path = realpath($settings['METADATA_PARSER']['stylesheet']);?

bondjimbond commented 5 years ago

Awesome, that worked! Thanks!

So I think this means the toolchain is pretty much ready to go -- except that my XSLTs don't work for the dublin_core.xml files that I have.

Guess it's time to figure out how to write custom ones...

Or could these files also be used with Twig, hopefully without modification?

mjordan commented 5 years ago

Nice!

To avoid using XSLT and just use plain old Twig, you'd have to parse out the values from the incoming DC XML. Not sure if doing that is preferable to tweaking the existing XSLT stylesheet.

mjordan commented 5 years ago

I'd be willing to try to tweak the XSLT if you want.

bondjimbond commented 5 years ago

That would be extremely helpful, thanks. The XSL is attached.

mods from MRU.xsl.txt

It looks like the XSLT we got from MRU is based on a different DC record than the one that comes from the Simple Archive format. Simple Archive DC elements look like this:

<dcvalue element="contributor" qualifier="author" language="">Salyers, Vince</dcvalue>

instead of something simple like:

<dc:contributor>Slayers, Vince</dc:contributor>

The benefit of the Simple Archive format is that it includes the qualifiers that are lost with basic DC, like the "author" qualifier.

mjordan commented 5 years ago

Can I use any of the DC XML files in the sample simple archive you sent me earlier?

bondjimbond commented 5 years ago

Yep! They're all real data.

mjordan commented 5 years ago

It's not perfect (for example, it only picks up a single subject) but it's a good start.

mods.xsl.txt

bondjimbond commented 5 years ago

That's fantastic. Thanks, @mjordan! I'll see if I can figure out how to get all instances of a given element.

mjordan commented 5 years ago

Wrapping the rule in a for-each is the standard way but it doesn't seem to be working here.

bondjimbond commented 5 years ago

I'm just looking at one of the files (ajp-v4-id1057.xml), but it seems to be handling the multiple subjects just fine.

I am seeing a weird problem, though, with <dcvalue element="publisher" qualifier="none" language="en_US"> getting mapped both to <mods:publisher> and <mods:genre>, and <dcvalue element="publisher" qualifier="uri" language="en_US"> mapping just to <mods:genre>.

I'll do some tweaking.

bondjimbond commented 5 years ago

@mjordan Regarding your subject problem - here's where you're going wrong:

<xsl:for-each select="dcvalue[@element='subject' and @qualifier='none' or qualifier='lcsh']">

You can't have an AND and an OR in the same line in XSL. These need to be in two separate select statements.

I'll have this finished soon and upload a new XSL file just for the record.

bondjimbond commented 5 years ago

Also, the line omits the @ in @qualifer='lcsh'

bondjimbond commented 5 years ago

Update: Nope, that's not the only problem. Sigh. More digging to do...

bondjimbond commented 5 years ago

Here's my updated XSLT -- everything working except, it seems, subjects.

Subjects with qualifier="none" work just fine, while subjects with qualifer="lcsh" are no good. I can't figure out why.

mods_fixed.xsl.txt

bondjimbond commented 5 years ago

And if I just remove the attribute stuff from the subject entry, it works with no problems. Bizarre.

mjordan commented 5 years ago

Ha, that's XSLT in a nutshell. I'm glad you got it working.

Once you're happy with the new toolchain components (and of course you have the stylesheet you need), let's review what we have and see if it's consistent with the other toolchains. Then, someone should document the new components... any opinions on any of this @MarcusBarnes ?

bondjimbond commented 5 years ago

Update again: I have no idea why this worked, but behold a working XSLT with subjects and authorities!

mods_allgood.xsl.txt

Will put the toolchain files together in a ZIP next and let you review. Or should this just go in a PR?

mjordan commented 5 years ago

I'll review first, then you can open a PR.

mjordan commented 5 years ago

Nice work BTW.

MarcusBarnes / mik

Simple Archive toolchain #501