Islandora / islandora_premis

This module produces XML and HTML representations of PREMIS metadata for objects in your repository.
GNU General Public License v3.0
6 stars 17 forks source link

PREMIS contentLocationValue is always empty #21

Closed mjordan closed 10 years ago

mjordan commented 10 years ago

foxml:contentLocation REF value is not being selected so premis:contentLocationValue is always empty.

ruebot commented 10 years ago

That's why that field is always blank!

mjordan commented 10 years ago

I was wrong - it is not blank for the new External and Redirect datastreams.... It used to work, as evidenced by https://gist.github.com/mjordan/8250658 ....this is an early example, pre-'agent'.... so we introduced a bug somewhere along the line. None of the other examples in my gists have values so the bug is pretty old.

Which reminds me... we probably should insert a 'date generated' comment in the PREMIS XML. I'm surprised there isn't a PREMIS header like there is a METS header that contains metadata about the XML file itself.

mjordan commented 10 years ago

Giving up for tonight and committing a couple other small changes to the XSL. The problem is definitely with the selecting the value of foxml:contentLocation/@REF, it's not with the variable assignment.

ruebot commented 10 years ago

Ok. I'm with you on the frustration. Just spend an hour trying to figure it out, and it doesn't make any sense at all.

I can get it to print anything in this tree, except contentLocation!

<foxml:datastream ID="JPG" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
    <foxml:datastreamVersion ID="JPG.0" LABEL="Medium sized JPEG" CREATED="2013-11-08T12:49:38.851Z" MIMETYPE="image/jpeg" SIZE="237956">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="9045e6ff00de22cd33b271dfeed65df51a733a80"/>
    <foxml:contentLocation TYPE="INTERNAL_ID" REF="yul:89067+JPG+JPG.0"/>
</foxml:datastreamVersion>
ruebot commented 10 years ago

Here is the full foxml:

<?xml version="1.0" encoding="UTF-8"?>
<foxml:digitalObject xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" VERSION="1.1" PID="yul:89067" xsi:schemaLocation="info:fedora/fedora-system:def/foxml# http://www.fedora.info/definitions/1/0/foxml1-1.xsd">
<foxml:objectProperties>
    <foxml:property NAME="info:fedora/fedora-system:def/model#state" VALUE="Active"/>
    <foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="&quot;City of Dover&quot; : bought by Penetang group"/>
    <foxml:property NAME="info:fedora/fedora-system:def/model#ownerId" VALUE="nruest"/>
    <foxml:property NAME="info:fedora/fedora-system:def/model#createdDate" VALUE="2013-11-08T12:49:34.259Z"/>
    <foxml:property NAME="info:fedora/fedora-system:def/view#lastModifiedDate" VALUE="2013-12-31T08:02:53.659Z"/>
</foxml:objectProperties>
<foxml:datastream ID="AUDIT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="false">
    <foxml:datastreamVersion ID="AUDIT.0" LABEL="Audit Trail for this object" CREATED="2013-11-08T12:49:34.259Z" MIMETYPE="text/xml" FORMAT_URI="info:fedora/fedora-system:format/xml.fedora.audit">
    <foxml:xmlContent>
        <audit:auditTrail xmlns:audit="info:fedora/fedora-system:def/audit#">
        <audit:record ID="AUDREC1">
            <audit:process type="Fedora API-M"/>
            <audit:action>addDatastream</audit:action>
            <audit:componentID>TECHMD_FITS</audit:componentID>
            <audit:responsibility>nruest</audit:responsibility>
            <audit:date>2013-11-08T12:49:38.223Z</audit:date>
            <audit:justification>Copied datastream from yul:89067.</audit:justification>
        </audit:record>
        <audit:record ID="AUDREC2">
            <audit:process type="Fedora API-M"/>
            <audit:action>addDatastream</audit:action>
            <audit:componentID>TN</audit:componentID>
            <audit:responsibility>nruest</audit:responsibility>
            <audit:date>2013-11-08T12:49:38.531Z</audit:date>
            <audit:justification>Copied datastream from yul:89067.</audit:justification>
        </audit:record>
        <audit:record ID="AUDREC3">
            <audit:process type="Fedora API-M"/>
            <audit:action>addDatastream</audit:action>
            <audit:componentID>JPG</audit:componentID>
            <audit:responsibility>nruest</audit:responsibility>
            <audit:date>2013-11-08T12:49:38.851Z</audit:date>
            <audit:justification>Copied datastream from yul:89067.</audit:justification>
        </audit:record>
        <audit:record ID="AUDREC4">
            <audit:process type="Fedora API-M"/>
            <audit:action>addDatastream</audit:action>
            <audit:componentID>JP2</audit:componentID>
            <audit:responsibility>nruest</audit:responsibility>
            <audit:date>2013-11-08T12:49:39.306Z</audit:date>
            <audit:justification>Copied datastream from yul:89067.</audit:justification>
        </audit:record>
        <audit:record ID="AUDREC5">
            <audit:process type="Fedora API-M"/>
            <audit:action>modifyObject</audit:action>
            <audit:componentID/>
            <audit:responsibility>anonymous</audit:responsibility>
            <audit:date>2013-12-31T08:02:52.959Z</audit:date>
            <audit:justification>PREMIS:eventType=fixity check; PREMIS:file=yul:89067+MODS+MODS.0; PREMIS:eventOutcome=SHA-1 checksum validated.  
        </audit:justification>
    </audit:record>
    <audit:record ID="AUDREC6">
        <audit:process type="Fedora API-M"/>
        <audit:action>modifyObject</audit:action>
        <audit:componentID/>
        <audit:responsibility>anonymous</audit:responsibility>
        <audit:date>2013-12-31T08:02:53.058Z</audit:date>
        <audit:justification>PREMIS:eventType=fixity check; PREMIS:file=yul:89067+DC+DC.0; PREMIS:eventOutcome=SHA-1 checksum validated.  
    </audit:justification>
</audit:record>
<audit:record ID="AUDREC7">
    <audit:process type="Fedora API-M"/>
    <audit:action>modifyObject</audit:action>
    <audit:componentID/>
    <audit:responsibility>anonymous</audit:responsibility>
    <audit:date>2013-12-31T08:02:53.659Z</audit:date>
    <audit:justification>PREMIS:eventType=fixity check; PREMIS:file=yul:89067+OBJ+OBJ.0; PREMIS:eventOutcome=SHA-1 checksum validated.  
</audit:justification>
</audit:record>
</audit:auditTrail>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="RELS-EXT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
    <foxml:datastreamVersion ID="RELS-EXT.0" LABEL="Fedora Object to Object Relationship Metadata." CREATED="2013-11-08T12:49:34.259Z" MIMETYPE="application/rdf+xml" FORMAT_URI="info:fedora/fedora-system:FedoraRELSExt-1.0" SIZE="544">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="8acd007d964a3bf29e44d0978c1369051a6abbd1"/>
    <foxml:xmlContent>
        <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" xmlns:fedora-model="info:fedora/fedora-system:def/model#" xmlns:islandora="http://islandora.ca/ontology/relsext#">
        <rdf:Description rdf:about="info:fedora/yul:89067">
        <fedora:isMemberOfCollection rdf:resource="info:fedora/yul:F0433"/>
        <fedora-model:hasModel rdf:resource="info:fedora/islandora:sp_large_image_cmodel"/>
    </rdf:Description>
</rdf:RDF>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="MODS" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
    <foxml:datastreamVersion ID="MODS.0" LABEL="MODS Record" CREATED="2013-11-08T12:49:34.259Z" MIMETYPE="text/xml" SIZE="2387">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="a94e53eb3f379cdd43594ae55047583400a08bbb"/>
    <foxml:xmlContent>
        <mods xmlns="http://www.loc.gov/mods/v3" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <identifier type="local">ASC02928</identifier>
        <identifier type="hdl">http://hdl.handle.net/10315/1849</identifier>
        <location>
            <physicalLocation>1974-002 / 192 (858)</physicalLocation>
        </location>
        <titleInfo>
            <title>"City of Dover" : bought by Penetang group</title>
        </titleInfo>
        <abstract>Image of small ship at dock in ice; large ship is at dock in background; probably Master Feeds in distance</abstract>
        <targetAudience>ASC Red Dot</targetAudience>
        <name>
            <namePart>Toronto Telegram</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">Publisher</roleTerm>
            </role>
        </name>
        <originInfo>
            <dateCreated>2000/03/31</dateCreated>
            <dateIssued>1949/04/05</dateIssued>
            <publisher>Toronto Telegram</publisher>
            <place>
                <placeTerm type="text">Canada</placeTerm>
            </place>
            <place>
                <placeTerm type="text">Toronto</placeTerm>
            </place>
        </originInfo>
        <typeOfResource>still image</typeOfResource>
        <genre authority="lctgm">Documentary Photography</genre>
        <language>
            <languageTerm authority="iso639-2b" type="code">eng</languageTerm>
        </language>
        <physicalDescription>
            <form>nonprojected graphic</form>
            <extent>1 photograph : b&amp;amp;w negative ; 10 x 13 cm</extent>
        </physicalDescription>
        <note>Box 1 CD 1B</note>
        <relatedItem>
            <titleInfo>
                <title>Toronto Telegram fonds, F0433</title>
            </titleInfo>
            <location>
                <location>
                    <url note="Finding Aid">http://archivesfa.library.yorku.ca/fonds/ON00370-f0000433.htm</url>
                </location>
            </location>
        </relatedItem>
        <subject>
            <topic>Toronto Telegram</topic>
        </subject>
        <accessCondition type="useAndReproduction">For further copyright information contact : ascproj@yorku.ca</accessCondition>
    </mods>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="DC" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
    <foxml:datastreamVersion ID="DC.0" LABEL="DC Record" CREATED="2013-11-08T12:49:34.259Z" MIMETYPE="text/xml" SIZE="1276">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="ad931b32519134be6074e22da8f332f79f268585"/>
    <foxml:xmlContent>
        <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:title>"City of Dover" : bought by Penetang group</dc:title>
        <dc:subject>Toronto Telegram</dc:subject>
        <dc:description>Image of small ship at dock in ice; large ship is at dock in background; probably Master Feeds in distance</dc:description>
        <dc:description>Box 1 CD 1B</dc:description>
        <dc:publisher>Toronto Telegram</dc:publisher>
        <dc:contributor>Toronto Telegram (Publisher)</dc:contributor>
        <dc:type>StillImage</dc:type>
        <dc:type>Documentary Photography</dc:type>
        <dc:format>1 photograph : b&amp;amp;w negative ; 10 x 13 cm</dc:format>
        <dc:format>nonprojected graphic</dc:format>
        <dc:identifier>yul:89067</dc:identifier>
        <dc:identifier>ASC02928</dc:identifier>
        <dc:identifier>http://hdl.handle.net/10315/1849</dc:identifier>
        <dc:identifier/>
        <dc:language>eng</dc:language>
        <dc:relation>Toronto Telegram fonds, F0433</dc:relation>
        <dc:rights>For further copyright information contact : ascproj@yorku.ca</dc:rights>
    </oai_dc:dc>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="OBJ" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
    <foxml:datastreamVersion ID="OBJ.0" LABEL="OBJ Datastream" CREATED="2013-11-08T12:49:34.259Z" MIMETYPE="image/tiff" SIZE="16129943">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="ca62dacfbe4ec9ea85c140295102cefabbb72b4d"/>
    <foxml:contentLocation TYPE="INTERNAL_ID" REF="yul:89067+OBJ+OBJ.0"/>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="TECHMD_FITS" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
    <foxml:datastreamVersion ID="TECHMD_FITS.0" LABEL="TECHMD_FITS" CREATED="2013-11-08T12:49:38.223Z" MIMETYPE="text/xml" SIZE="8659">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="9f777fbf664f54ec0e8df96a7e37a3179a96d9be"/>
    <foxml:contentLocation TYPE="INTERNAL_ID" REF="yul:89067+TECHMD_FITS+TECHMD_FITS.0"/>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="TN" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
    <foxml:datastreamVersion ID="TN.0" LABEL="Thumbnail" CREATED="2013-11-08T12:49:38.531Z" MIMETYPE="image/jpeg" SIZE="31133">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="dfdf5f8fdf9bc741a423cef6f9c7de3d925d4094"/>
    <foxml:contentLocation TYPE="INTERNAL_ID" REF="yul:89067+TN+TN.0"/>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="JPG" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
    <foxml:datastreamVersion ID="JPG.0" LABEL="Medium sized JPEG" CREATED="2013-11-08T12:49:38.851Z" MIMETYPE="image/jpeg" SIZE="237956">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="9045e6ff00de22cd33b271dfeed65df51a733a80"/>
    <foxml:contentLocation TYPE="INTERNAL_ID" REF="yul:89067+JPG+JPG.0"/>
</foxml:datastreamVersion>
</foxml:datastream>
<foxml:datastream ID="JP2" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
    <foxml:datastreamVersion ID="JP2.0" LABEL="JPEG 2000" CREATED="2013-11-08T12:49:39.306Z" MIMETYPE="image/jp2" SIZE="336316">
    <foxml:contentDigest TYPE="SHA-1" DIGEST="3309e618a40456a72970d966a0697c2790e705ee"/>
    <foxml:contentLocation TYPE="INTERNAL_ID" REF="yul:89067+JP2+JP2.0"/>
</foxml:datastreamVersion>
</foxml:datastream>
</foxml:digitalObject>
ruebot commented 10 years ago

I take that back, I can't get anything from datastreamVersion to print either. Just contentDigest. Progress?

mjordan commented 10 years ago

Haven't started on this one yet today. Will merge in branch mentioned in #14 first.

ruebot commented 10 years ago

I pushed pretty xml. If that messes up your merge, let me know, and I'll roll back.

mjordan commented 10 years ago

Yeah, please do. Just tried to merge and got conflicts. Have reset --hard so I'm back on c3ede93f6d6f237f4570d5c5096d840f0e8b2e7e.

ruebot commented 10 years ago

done

mjordan commented 10 years ago

Pulled but still have conflicts. Let me take a look.

mjordan commented 10 years ago

This is weird - when I git pull I go back to c3ede93f6d6f237f4570d5c5096d840f0e8b2e7e and tells me I'm up to date, but when I visit https://github.com/ruebot/islandora_premis/commits/7.x it tells me that 9836700920 is the latest. Same in two different browsers so it's not a cache issue. Any idea why the discrepancy?

ruebot commented 10 years ago

c3ede93 was the pretty print commit. I got rid of that in origin. HEAD should be at 9836700 now, which is you last commit, and where said not to do anything.... which I violated.

mjordan commented 10 years ago

So should I revert back to 9836700920b245dabc78b9026bb650da6e13d759 in my local copy?

ruebot commented 10 years ago

yeah!

git reset --hard HEAD~1 on your 7.x branch should do it.

mjordan commented 10 years ago

OK, thanks, back at 9836700920b245dabc78b9026bb650da6e13d759. Let me try my merge again.

mjordan commented 10 years ago

OK, success, will push if you think it's OK.

ruebot commented 10 years ago

PUSH!

I'll continue hacking on my end... but pretty print first. If you're a vim user, this is killer :%!xmllint --format - (pretty prints and validates!)

mjordan commented 10 years ago

Pushed.

I, sir, am a vim user. Thanks!

ruebot commented 10 years ago

nmap <Leader>xml :%!xmllint --format -<CR>

ruebot commented 10 years ago

/foxml:digitalObject/foxml:datastream/foxml:datastreamVersion/foxml:contentLocation/@REF should do it, right?

This is what I am getting here:

screenshot from 2014-01-08 11 47 54

mjordan commented 10 years ago

You can try it. But foxml:datastreamVersion is the context node, so I don't see why foxml:contentLocation/@REF won't work. Similar context-node queries work elsewhere.

BTW, saxon does give me some output using the current foxml_to_premis.xsl. Could be a stupid obscure PHP or libxslt bug. But it did work in earlier versions of the stylesheet. That's what I don't understand.

dmoses commented 10 years ago

Really strange ... i poked at this a little bit. It is a puzzle. It seems to be the xpath, but the path is right. It works if you the change the content_location variable to this: <xsl:variable name="content_location" select="//foxml:contentLocation/@REF"/> The variable is declared further down as well ... is it working there ?

ruebot commented 10 years ago

This is my current version of the file, and it doesn't seem to be working here.

dmoses commented 10 years ago

Nick ... I run the foxml you provided previously through the xslt you linked to and it gets transformed by Oxygen and the elements get populated. See it here: https://gist.github.com/dmoses/8322788 The transformer I'm using is saxon 6.5.5.

mjordan commented 10 years ago

The plot thickens...

I am getting expected values in contentLocationValue when I run the following CLI PHP script, which BTW is essentially the same code as we're running in the module. Also, foxml_to_premis.xsl is the same one that is not working for me in the module:

<?php

$xsl_doc = new DOMDocument();
$xsl_doc->load("foxml_to_premis.xsl");

$xml_doc = new DOMDocument();
$xml_doc->load("changeme_15.fox.xml");

$xslt_proc = new XSLTProcessor();
$xslt_proc->importStylesheet($xsl_doc);

$output = $xslt_proc->transformToXML($xml_doc);

print $output;
?>

When I grep the output, I get:

       <contentLocationValue/>
        <contentLocationValue/>
        <contentLocationValue/>
        <contentLocationValue/>
        <contentLocationValue>changeme:15+OBJ+OBJ.0</contentLocationValue>
        <contentLocationValue>changeme:15+TECHMD+TECHMD.0</contentLocationValue>
        <contentLocationValue>changeme:15+TN+TN.0</contentLocationValue>
        <contentLocationValue>changeme:15+MEDIUM_SIZE+MEDIUM_SIZE.0</contentLocationValue>

but the changeme:15:OBJ+OBJ.O, etc do not appear in the version generated via the module. Same stylesheet. This has to be a PHP issue.

mjordan commented 10 years ago

This is a shameful hack, but what if we added XML parsing code to the islandora_premis_run_xsl_transform() function to grab the value of foxml:contentLocation/@REF and passed it into the stylesheet as a parameter? We may never figure out what is causing this truly cruel bug.

ruebot commented 10 years ago

Let's try it, and see what happens!

@dmoses good to know! @edf shared a perl transformer that worked as well. So, there is definitely something super wonky here that we're not seeing.

mjordan commented 10 years ago

I can try it tonight, but feel free to take a stab sooner if you have time.

mjordan commented 10 years ago

Found the problem... this is frigging hilarious.

When exported (which is what we do in this module), FOXML doesn't contain any foxml:contentLocation elements for Managed datastreams. Instead, the content for those datastreams is embedded within the XML itself as base64 strings in foxml:binaryContent tags. See https://gist.github.com/mjordan/8329267 for an example of the FOXML were are applying our stylesheet to. Go to line 1081 to see the OBJ datastream file embedded in the XML.

So, we aren't getting any matches for foxml:contentLocation/@REF for Managed datastreams because there aren't any of those elements in exported FOXML that we're running through the stylesheet.

Reason I wrote that particular XSL to match on foxml:contentLocation/@REF is that within the Fedora Web Administrator, the "Object XML" does contain foxml:contentLocation elements. For example, the FOXML snippet for the OBJ datastream that I referred you to in the gist linked above is:

`

` So, that's the problem. No weird PHP or libxslt bugs like I thought. Just bugs crawling all over me, every day, all day, even in the shower, like that character in A Scanner Darkly. My success with the CLI script was due to the fact that I copied the FOXML from the Web Administrator. When I get my head together I'll try to think of where we **can** get a sensible value for retrieving the datastream. Maybe just the DSID, since that's really what you use when you want to retrieve the datastream anyway? We could include a comment saying something to that effect.
edf commented 10 years ago

Really glad you figured out the weirdness!

Just a thought, instead of 'archive' could 'migrate' be used to reduce the file size of the binary stuff on line 110 in utilities.inc ?

mjordan commented 10 years ago

Awesome suggestion - using the 'migrate' format FOXML you get usable contentLocation values:

<foxml:contentLocation TYPE="INTERNAL_ID" REF="http://localhost:8080/fedora/get/changeme:20/TECHMD/2013-12-18T06:52:46.100Z"/>

As long as we're happy with this sort of URL (not sure what relevance 'localhost' has in the context of a PREMIS XML file), we're good to go with a one-line change to utilites.inc.

ruebot commented 10 years ago

Oh. This makes since then, because I was getting my FOXML from the web administrator as well.

:+1: on the proposed solution. I think value presented there is more representative of "where" it actually is.

dmoses commented 10 years ago

Nice catch and the contentLocation values look more meaningful to me. I'm guessing that the localhost:8080 is what's defined in the fedora.fcfg file?

ruebot commented 10 years ago

Looks good!

mjordan commented 10 years ago

@dmoses I think the hostname is the one configured in the main Islandora module's admin settings, since that's where Islandora deposited the content. One 'feature' of the PREMIS module is that since it is generated on the fly, it's a snapshot of what Islandora thinks is the current location for a datastream at the time PREMIS is viewed/downloaded. Not sure if that answers your question though.