Closed bondjimbond closed 4 years ago
Thank you @bondjimbond for the use-case. If there are any other groups interested in using this kind of toolchain to migrate from DSpace, please chime in the comments so that I can gauge interest. Thank you in advance.
Looking at the Simple Archive Format now. You will likely need to write all of the components in a new toolchain as described at https://github.com/MarcusBarnes/mik/wiki/Information-for-developers#writingcomponents.
archive_directory
, and grab the DC file from within each item directorycontents
(if I am understanding what is in contents
)This may sound like a lot, but it's really just creating a new class file for each of those components, implementing the required methods. If you can reuse the DC metadata parser, you've saved some work there.
@mjordan @MarcusBarnes It's starting to look like this is going to be the best approach... their implementation of OAI is too weird for MIK to handle as-is.
I don't see a DC-to-MODS metadata parser. Would creating a Twig template make sense here?
This is a sample DC XML file:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author" language="">Hayman, Richard</dcvalue>
<dcvalue element="date" qualifier="accessioned" language="">2014-02-13T20:44:02Z</dcvalue>
<dcvalue element="date" qualifier="available" language="">2014-02-13T20:44:02Z</dcvalue>
<dcvalue element="date" qualifier="issued" language="">2009</dcvalue>
<dcvalue element="identifier" qualifier="citation" language="en_US">Hayman, R. (2009). Human rights software: Information support solutions for social justice. Information for Social Change, 29, 44-67.</dcvalue>
<dcvalue element="identifier" qualifier="issn" language="">1756-901X</dcvalue>
<dcvalue element="identifier" qualifier="uri" language="">http://hdl.handle.net/11205/98</dcvalue>
<dcvalue element="description" qualifier="abstract" language="en_US">Human rights centres and non-governmental organizations (NGOs) have crucial information support needs, many of which can be met by the existing and ongoing development of information technology software applications. For communication and Internet use, the psiphon program allows for secure and anonymous information exchange and distribution, including firewall circumvention. For data collection, organization, encryption, and storage, Martus software can be deployed to help protect sensitive information and identities. Based on documented projects and websites, the following research examines these emancipatory tools to determine: the technologies in use, emergent, and under development; their possible usage in the critical arenas under discussion; and, the greater effects of these technologies as they relate to social justice and information access in the global information society. The purpose is to raise awareness within human rights communities and information centres about the existence and availability of these tools, so that these groups may find appropriate and accessible solutions that match their information support needs. Further, it is hoped that the information presented here will generate open, intercultural, and international discussions of human rights policy development, strategic planning, and implementation.</dcvalue>
<dcvalue element="description" qualifier="provenance" language="en">Submitted by Richard Hayman (rhayman@mtroyal.ca) on 2014-02-13T20:44:02Z
No. of bitstreams: 2
Human Rights Software.pdf: 162558 bytes, checksum: faff81a56afb49c3924691f538d9fd5e (MD5)
license_rdf: 1232 bytes, checksum: dffe24314e72b44d3936efcc12015d3f (MD5)</dcvalue>
<dcvalue element="description" qualifier="provenance" language="en">Made available in DSpace on 2014-02-13T20:44:02Z (GMT). No. of bitstreams: 2
Human Rights Software.pdf: 162558 bytes, checksum: faff81a56afb49c3924691f538d9fd5e (MD5)
license_rdf: 1232 bytes, checksum: dffe24314e72b44d3936efcc12015d3f (MD5)
  Previous issue date: 2009</dcvalue>
<dcvalue element="language" qualifier="iso" language="en_US">en</dcvalue>
<dcvalue element="publisher" qualifier="none" language="en_US">Information for Social Change</dcvalue>
<dcvalue element="rights" qualifier="none" language="*">Attribution-NonCommercial-NoDerivs 2.5 Canada</dcvalue>
<dcvalue element="rights" qualifier="uri" language="*">http://creativecommons.org/licenses/by-nc-nd/2.5/ca/</dcvalue>
<dcvalue element="subject" qualifier="none" language="en_US">Human rights</dcvalue>
<dcvalue element="subject" qualifier="none" language="en_US">Social justice</dcvalue>
<dcvalue element="subject" qualifier="none" language="en_US">Librarianship</dcvalue>
<dcvalue element="title" qualifier="none" language="en_US">Human Rights Software: Information Support Solutions For Social Justice</dcvalue>
<dcvalue element="type" qualifier="none" language="en_US">Article</dcvalue>
<dcvalue element="publisher" qualifier="uri" language="en_US">http://libr.org/isc/</dcvalue>
<dcvalue element="metadata" qualifier="review" language="">Edited, TR</dcvalue>
</dublin_core>
@mjordan So trying to figure out what is needed for a new toolchain like this... I'm not 100% certian what each component does, so this might be inaccurate.
We'll need a new Fetcher for sure... Perhaps the Filesystem fetcher can be tweaked? It will need to get a list of subdirectories, use the subdirectory names as identifiers, and fetch the dublin_core.xml file (and rename it to the identifier I guess?)
Metadata Parser: Could I just use the Templated one? But will need to figure out how to turn something like dcvalue element="contributor" qualifier="author"
into a Twig variable.
Filegetter: I'm not sure whether the CONTENTS file is the way to go or not. Here's a sample of the contents of a CONTENTS file:
5 Gamification Paradigm.pdf bundle:ORIGINAL
license_rdf bundle:CC-LICENSE
license.txt bundle:LICENSE
5 Gamification Paradigm.pdf.txt bundle:TEXT description:Extracted text
The license files can be ignored. The important files are the PDF (or other filetype) and the TXT.
We also have objects with multiple files in them, which look like they'll be compound objects... That's more complicated.
Writer: Do we need a new writer, then? I'm not sure how these things really work.
@bondjimbond admittedly, the required methods, etc. of each type of class are not thoroughly documented, but the "Required properties and functions in subclassed components" section of https://github.com/MarcusBarnes/mik/wiki/Information-for-developers (which you probably have already seen) does provide some detail.
A twig template for the metadata parser is a good idea, but since you're getting XML from the source, maybe an XSLT (DC->MODS) would be a better fit?
@mjordan Yes, so long as it's good enough to meet our requirements. LoC transforms are often inadequate.
This is the latest one: https://www.loc.gov/standards/mods/v3/DC_MODS3-5_XSLT1-0.xsl
Have we got a way in MIK to apply an XSLT to a metadata file?
I can help with that.
Thanks. :) I think I'm going to need a lot of help developing this..
@bondjimbond will this XSL, or some modified version of it, do the trick? http://www.loc.gov/standards/mods/v3/DC_MODS3-5_XSLT1-0.xsl
Yeah, that's the plan. We should leave it open to customization (allow user to point to a locally-stored XSLT rather than hard-code the remote one), but that's a good default.
OK, let me whip up the start of a metadata parser that takes in a DC file and writes out a corresponding MODS file.
Awesome, thanks!
@bondjimbond bad news - that LoC stylesheet won't work with the sample DC file you provide above.
@mjordan I'm not surprised; I expect it will need some modification.
First step is to get the parser working on a standard DC document: https://arcabc.ca/islandora/object/dc%3A34849/datastream/MODS/view
I should be able to get my hands on the transform file that they use to create their OAI output; that will be the "custom" one that we should be able to drop in.
I've got an Xslt.php metadata parser, based on the Templated.php parser but instead of using Twig to generate the output, it uses a stylesheet. It doesn't hard code DC or MODS, it could be used with any stylesheet. It's hard to test it in use without the other toolchain components though. @bondjimbond have you done any work on the fetcher or filegetter yet? Want me to share my parser so you can have a look? Not sure what the best way to proceed is.
Please share your parser. I have looked at the fetcher and filegetter files, but I haven't tried any code yet... still trying to understand how they work.
OK, it's attached. It requires a new .ini setting: METADATA_PARSER']['stylesheet']
. I haven't tested it within a toolchain yet, but all the parts are there. One thing I'm not sure about is whether we'll need to register namespaces to use it on some XML files. If so, I'd say that the .ini file is the place to register them, so it's extensible beyond DC->MODS. But maybe we could hard code them for now to get going.
To use this, put it in the mik\metadataparsers\templated
directory. The [METADATA_PARSER]
section of your .ini file should look like this:
[METADATA_PARSER]
class = templated\Xslt
stylesheet = mystylesheet.xsl
Woops, use this one.
@bondjimbond the fetcher will need to extract the Simple Archive Format zip and iterate through each directory. The record_key for each item should be the directory name. The fetcher's getNumRecs
method should return the number of item subdirectories. Its getRecords
method will return an array of all the dublin_core.xml documents, keyed using the record_key.
Your filegetter's getChildren
method will return an empty array (no children), and its getFilePath
method will return the path to the PDF, etc. I think this would be the path to the file identified in the 'contents' manifest line that contains bundle:ORIGINAL
.
The writer writes out the files generated/retrieved by the other toolchain components. I think you should be able to create a writer based on the existing CsvSingleFileJson.php writer class. You will need to adjust the details withn https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104, but everything from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104 should work as is. The writeMetadataFile
method should work as is too, I think. Of course, rename your class in https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L13 from 'CsvSingleFileJson' to 'SimpleArchiveFormat' and make sure the PHP file has the same name.
@mjordan Thanks. I think I can skip the zip part and process as an unzipped directory tree to begin with, till everything else is complete.
So for the Fetcher: it looks like the Filesystem fetcher might be a good place to start. But I'm having trouble grasping what it's doing, exactly.
The record_key for each item should be the directory name
So in https://github.com/MarcusBarnes/mik/blob/master/src/fetchers/Filesystem.php#L40-L54:
it looks like the code as-is should already be taking the directory name as the record_key; is that right?
And then in https://github.com/MarcusBarnes/mik/blob/master/src/fetchers/Filesystem.php#L93-L103..
Would this do the trick to acquire the DC files?
$files_with_name = $this->input_directory . DIRECTORY_SEPARATOR . $record_key . DIRECTORY_SEPARATOR . "dublin_core.xml";
I think you should be able to create a writer based on the existing CsvSingleFileJson.php writer class. You will need to adjust the details withn https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104, but everything from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvSingleFileJson.php#L88-L104 should work as is.
@mjordan Those are the same lines... Which are the ones that need changing?
I'm working on some of these, but operating blind until I test them, I guess. I may have some questions for you soon.
@mjordan Well I'm giving it a shot, but I do foresee a lot of failure.
First attempt crashed out on the Fetcher with a very unhelpful error message:
Fatal error: Uncaught mik\exceptions\MikErrorException in /Users/brandon/sfuvault/mik/mik:105
Stack trace:
#0 /Users/brandon/sfuvault/mik/src/fetchers/Filesystem_Subdirectories.php(61): {closure}(8, 'Undefined varia...', '/Users/brandon/...', 61, Array)
#1 /Users/brandon/sfuvault/mik/src/fetchers/Filesystem_Subdirectories.php(81): mik\fetchers\Filesystem_Subdirectories->getRecords()
#2 /Users/brandon/sfuvault/mik/mik(131): mik\fetchers\Filesystem_Subdirectories->getNumRecs()
#3 {main}
thrown in /Users/brandon/sfuvault/mik/mik on line 105
Filesystem_Subdirectories.php.txt mods.xsl.txt mru_xsl.ini.txt SimpleArchive.php.filegetter.txt SimpleArchive.php.writer.txt
@bondjimbond I've done a bit of work to these and am attaching a zip containing my versions. Your .ini should work with them. This toolchain produces output, but the .xml files are not being created properly. The stylesheet is not transforming the dublin core in each item's dublin_core.xml file to MODS.
Thanks, @mjordan! I don't get output with these, but I do get errors that produce possibly helpful direction.
[2019-09-03 15:24:24] ErrorException.ERROR: ErrorException {"message":"file_get_contents(mods.xsl): failed to open stream: No such file or directory",
...
"file":"/Users/brandon/sfuvault/mik/src/metadataparsers/templated/Xslt.php","line":78} []
At line 78 I have this code:
$xslt = file_get_contents($this->xslt_stylesheet_path);
In my .ini file, I have:
stylesheet = "mru/mods.xsl"
I had originally had this without quotes, but it's the same result. The actual mods.xsl file is in mik/mru/mods.xsl.
Is there something wrong with my path, perhaps?
Update: It's not the path. Same error even if I set the path to stylesheet = "/users/Brandon/sfuvault/mik/mru/mods.xsl"
Update again: Problem seems to be the code... $xslt = file_get_contents($this->xslt_stylesheet_path);
kills the whole function (tested with print
statements before and after). But $xslt = file_get_contents("/users/Brandon/sfuvault/mik/mru/mods.xsl");
actually produces results (albeit problematic ones).
One more update:
I tried using standard DC files and the standard LOC transform. If I hard-code line 78 of Xslt.php
to the standard DC-to-MODS XSL, it works!
So there are two tasks ahead...
@bondjimbond can you make the following change to see if it resolves the path issue? Change file_get_contents($this->xslt_stylesheet_path);
to file_get_contents(realpath($this->xslt_stylesheet_path));
and then try to use stylesheet = "mru/mods.xsl"
in your config?
@mjordan New error: "Filename cannot be empty" (referring to same line)
Hm, it worked for me. It is possible I zipped up the wrong file. Anyway, my laptop is at home so I won't be able to investigate until this evening. Sounds like we're making progress though.
Just opened the zip to take a look. Problem is in line 20 of the metadataparser. Can you change it to $this->xslt_stylesheet_path = realpath($settings['METADATA_PARSER']['stylesheet']);
?
Awesome, that worked! Thanks!
So I think this means the toolchain is pretty much ready to go -- except that my XSLTs don't work for the dublin_core.xml files that I have.
Guess it's time to figure out how to write custom ones...
Or could these files also be used with Twig, hopefully without modification?
Nice!
To avoid using XSLT and just use plain old Twig, you'd have to parse out the values from the incoming DC XML. Not sure if doing that is preferable to tweaking the existing XSLT stylesheet.
I'd be willing to try to tweak the XSLT if you want.
That would be extremely helpful, thanks. The XSL is attached.
It looks like the XSLT we got from MRU is based on a different DC record than the one that comes from the Simple Archive format. Simple Archive DC elements look like this:
<dcvalue element="contributor" qualifier="author" language="">Salyers, Vince</dcvalue>
instead of something simple like:
<dc:contributor>Slayers, Vince</dc:contributor>
The benefit of the Simple Archive format is that it includes the qualifiers that are lost with basic DC, like the "author" qualifier.
Can I use any of the DC XML files in the sample simple archive you sent me earlier?
Yep! They're all real data.
It's not perfect (for example, it only picks up a single subject) but it's a good start.
That's fantastic. Thanks, @mjordan! I'll see if I can figure out how to get all instances of a given element.
Wrapping the rule in a for-each is the standard way but it doesn't seem to be working here.
I'm just looking at one of the files (ajp-v4-id1057.xml), but it seems to be handling the multiple subjects just fine.
I am seeing a weird problem, though, with <dcvalue element="publisher" qualifier="none" language="en_US">
getting mapped both to <mods:publisher>
and <mods:genre>
, and <dcvalue element="publisher" qualifier="uri" language="en_US">
mapping just to <mods:genre>
.
I'll do some tweaking.
@mjordan Regarding your subject problem - here's where you're going wrong:
<xsl:for-each select="dcvalue[@element='subject' and @qualifier='none' or qualifier='lcsh']">
You can't have an AND and an OR in the same line in XSL. These need to be in two separate select statements.
I'll have this finished soon and upload a new XSL file just for the record.
Also, the line omits the @
in @qualifer='lcsh'
Update: Nope, that's not the only problem. Sigh. More digging to do...
Here's my updated XSLT -- everything working except, it seems, subjects.
Subjects with qualifier="none"
work just fine, while subjects with qualifer="lcsh"
are no good. I can't figure out why.
And if I just remove the attribute stuff from the subject entry, it works with no problems. Bizarre.
Ha, that's XSLT in a nutshell. I'm glad you got it working.
Once you're happy with the new toolchain components (and of course you have the stylesheet you need), let's review what we have and see if it's consistent with the other toolchains. Then, someone should document the new components... any opinions on any of this @MarcusBarnes ?
Update again: I have no idea why this worked, but behold a working XSLT with subjects and authorities!
Will put the toolchain files together in a ZIP next and let you review. Or should this just go in a PR?
I'll review first, then you can open a PR.
Nice work BTW.
I've got a DSpace repository to migrate, and aside from OAI-PMH, the most promising of their export options seems to be the Simple Archive format.
This provides a directory full of subdirectories, each subdirectory representing an object, containing the main object file (PDF or JPG or whatever), metadata, and various other datastreams (or "bitstreams" in DSpace terms).
At minimum, every directory will have a dublin_core.xml file and the main object file.
A sample export is attached.
Basically, the toolchain will need to loop through the subdirectories, and in each subdirectory, do the following:
partial_export_2019_Jun_04_3_5.zip