DCLP / dclpxsltbox

Sandbox for development, testing, and review of XSLT for DCLP
http://dclp.github.io/dclpxsltbox/
1 stars 5 forks source link

Mapping doesn’t produce expected xpaths #360

Closed Edelweiss closed 6 years ago

Edelweiss commented 6 years ago

compare xpaths from different rdf files that refer to the same TM no:

xpath as expected by SoSOL’s interface to the number sever:

/rdf:RDF/
    rdf:Description[@rdf:about='http://papyri.info/dclp/129793/source']/
        dcterms:relation/
            @rdf:resource[not(. =//dcterms:replaces/@rdf:resource)]

RDF » Description » relation » @resource

Edelweiss commented 6 years ago

step by step analysis to track down error…

[1] Compare papyri and DCLP (navigator, master branch, mapping for HGV)

(files are identical: git diff papyri/master...dclp/master -- pn-mapping/xslt/hgv-rdf.xsl)

git remote -v
dclp    git@github.com:DCLP/navigator.git (fetch)
dclp    git@github.com:DCLP/navigator.git (push)
papyri    git@github.com:papyri/navigator.git (fetch)
papyri    git@github.com:papyri/navigator.git (push)

[2] Run test scenario on HGV file 4760 (idp.data, master branch, HGV file 4760.xml)

(files are identical: git diff papyri/master...dclp/master -- HGV_meta_EpiDoc/HGV5/4760.xml)

git remote -v
dclp    git@github.com:DCLP/idp.data.git (fetch)
dclp    git@github.com:DCLP/idp.data.git (push)
papyri  git@github.com:papyri/idp.data.git (fetch)
papyri  git@github.com:papyri/idp.data.git (push)

[3] Compare output using vimdiff (turn indentation on for test run, omit list of namespaces in output)

<rdf:Description rdf:about="http://papyri.info/hgv/4760/source">
   <dct:identifier>papyri.info/hgv/4760</dct:identifier>
   <dct:identifier>tm:4760</dct:identifier>
   <dct:identifier>
      <rdf:Description rdf:about="http://papyri.info/hgv/BGU_7_1510">
         <dct:identifier rdf:resource="http://papyri.info/hgv/4760/source"/>
      </rdf:Description>
   </dct:identifier>
   <dct:isPartOf>
      <rdf:Description rdf:about="http://papyri.info/hgv/BGU_7">
         <dct:bibliographicCitation>BGU 7</dct:bibliographicCitation>
         <rdf:type rdf:resource="http://purl.org/ontology/bibo/Book"/>
         <dct:isPartOf>
            <rdf:Description rdf:about="http://papyri.info/hgv/BGU">
               <rdf:type rdf:resource="http://purl.org/ontology/bibo/Series"/>
               <dct:bibliographicCitation>BGU</dct:bibliographicCitation>
               <dct:isPartOf rdf:resource="http://papyri.info/hgv"/>
            </rdf:Description>
         </dct:isPartOf>
      </rdf:Description>
   </dct:isPartOf>
   <dct:relation rdf:resource="http://www.trismegistos.org/text/4760"/>
   <dct:relation>
      <rdf:Description rdf:about="http://papyri.info/trismegistos/4760">
         <dct:relation rdf:resource="http://papyri.info/hgv/4760/source"/>
      </rdf:Description>
   </dct:relation>
   <dct:source>
      <rdf:Description rdf:about="http://papyri.info/hgv/4760/work">
         <dct:bibliographicCitation>BGU 7, 1510</dct:bibliographicCitation>
      </rdf:Description>
   </dct:source>
   <rdfs:label>BGU</rdfs:label>
   <foaf:page>
      <rdf:Description rdf:about="http://papyri.info/hgv/4760">
         <foaf:topic rdf:resource="http://papyri.info/hgv/4760/source"/>
      </rdf:Description>
   </foaf:page>
</rdf:Description>

(files are identical: vimdiff ~/Desktop/papyri_rdf.xml ~/Desktop/dclp_rdf.xml)

[4] Compare RDFs as produced by the numbers server

→ different xpath hierarchy

Edelweiss commented 6 years ago

blocker for #346

Edelweiss commented 6 years ago

Hugh in an e-mail:

It looks to me like there’s no difference in content. My guess is that your more-recent version of Jena is just serializing RDF XML differently, but it’s the same RDF. The takeaway is that we shouldn’t be using XPath to parse RDF because it’s the wrong tool for the job. Honestly, I’m surprised we got away with it as long as we did. I can suggest two possible solutions:

1) (quick and dirty) Just rewrite the XPaths in lib/numbers_rdf.rb to the new format. 2) Use Ruby-RDF to extract the data instead. Rewrite numbers_rdf.rb to do the right things.

I’m inclined to do #2, and probably will. But if you need to just get it working very quickly, there’d be no harm in #1 as an interim solution.

Edelweiss commented 6 years ago

https://github.com/DCLP/sosol/tree/issue330

Edelweiss commented 6 years ago

»replaces« constraint remains unclear (last part of the xpath)

/rdf:RDF/rdf:Description[@rdf:about='http://#{identifier}/source']/dcterms:relation/
@rdf:resource[not(. =//dcterms:replaces/@rdf:resource)]

I would have expected one of the following two examples to come along with a »replaces« tag as one file replaces the other

http://papyri.info/ddbdp/bgu;1;1/rdf http://papyri.info/ddbdp/p.louvre;1;4/rdf

as defined in the reprint clause

<ref n="p.louvre;1;4" type="reprint-in">P.Louvre 1.4</ref>

Here is where the »replaces« tag is written

https://github.com/DCLP/navigator/blob/master/pn-mapping/xslt/dclp-rdf.xsl#L78
Edelweiss commented 6 years ago

reprint definition taken from P. Louvre I 4

<body>
<head n="11853" xml:lang="en">
<date>AD -166</date>
<placeName>Soknopaiou Nesos</placeName>
<ref n="bgu;1;1|bgu;1;337|chr.wilck;;92" type="reprint-from">Chrest.Wilck. 92, BGU 1 337 (col 1 only), BGU 1 1 (col 2 only)</ref>
</head>
…
</body>
Edelweiss commented 6 years ago

Example P. Louvre I 4, which is a reprint from BGU I 1 and various other publications (bgu;1;1|bgu;1;337|chr.wilck;;92):

Even though the reprint information is in the EpiDoc file

https://github.com/DCLP/idp.data/blob/master/DDB_EpiDoc_XML/p.louvre/p.louvre.1/p.louvre.1.4.xml#L58

and even though the xslt obviously picks up the information

https://github.com/DCLP/navigator/blob/master/pn-mapping/xslt/dclp-rdf.xsl#L75

the reprint defintion doesn’t appear in the final RDF

http://papyri.info/ddbdp/p.louvre;1;4/rdf

In the RDF there’s no connection whatsover to BGU I 1 and the other publications.

Edelweiss commented 6 years ago

But the »replaces« relations are there in the xml code generated by ddbdp-rdf.xsl

   <dct:replaces rdf:resource="http://papyri.info/ddbdp/bgu;1;1/source"/>
   <dct:replaces rdf:resource="http://papyri.info/ddbdp/bgu;1;337/source"/>
   <dct:replaces rdf:resource="http://papyri.info/ddbdp/chr.wilck;;92/source"/>

I therefore consider it obsolete and will omit it