ajs6f / fcrepo3-rdf-extractor

A utility to extract RDF triples from Fedora Commons 3 Akubra-based persistence stores.
Other
0 stars 2 forks source link

Exception Caused by: javax.xml.stream.XMLStreamException: The namespace URI "XXXXXXXXX" has not been bound to a prefix. #5

Open ajs6f opened 6 years ago

ajs6f commented 6 years ago

When RDF/XML includes a namespace without a prefix, a la:

<rdf:RDF xmlns:islandora="http://islandora.ca/ontology/relsint#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/uofm:2211108/JP2">
    <width xmlns="http://islandora.ca/ontology/relsext#">4478</width>
    <height xmlns="http://islandora.ca/ontology/relsext#">6762</height>
  </rdf:Description>
</rdf:RDF>

the triples derived from elements like width are being dropped with a message similar to the title of this ticket.

ajs6f commented 6 years ago

@whikloj, can you send me a generous portion of log? I ask because I can reproduce this error up to the fact that the triple is dropped, but I cannot get any error message or exception. I'm just seeing silent dropping.

ajs6f commented 6 years ago

Well, this is embarrassing. I have to take back my immediately previous comment-- when I adjust my tests correctly, I can see now that I'm not actually seeing the triples dropped. I thought I was, but it was actually the case that I hadn't updated the test rubrics! 🤷‍♂️

So that leaves us with your example. Can you give me some FOXML that shows this happening, and also the logs section?

Additionally, can you please make sure that you are building from the latest source-- I just made a commit that updated the version of xerces in use and that might be quite relevant to why you are seeing this and I am not.

whikloj commented 6 years ago

Okay I take that back, I forgot to switch to master. Which I did and rebuild (mvn clean package) and ran again.

[whikloj@jujo]~% java -jar /opt/fcrepo3-rdf-extractor/target/fcrepo3-rdf-extractor-0.0.1-SNAPSHOT.jar -a /usr/local/fedora/server/config/spring/akubra-llstore.xml -g 'ca.umanitoba.dam.fedora#ri' -o /var/indexes/triples --logback rdf-extractor-logback.xml uofm:2652608
Exception in thread "main" java.lang.NoClassDefFoundError: com.google.common.base.Function
    at org.akubraproject.map.IdMappingBlobStore.openConnection(IdMappingBlobStore.java:61)
    at edu.si.fcrepo.Extract.lambda$init$0(Extract.java:210)
    at edu.si.fcrepo.Extract$$Lambda$20.00000000DC9AC560.call(Unknown Source)
    at edu.si.fcrepo.UnsafeIO.unsafeIO(UnsafeIO.java:44)
    at edu.si.fcrepo.Extract.init(Extract.java:210)
    at edu.si.fcrepo.Extract.main(Extract.java:166)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Function
    at java.net.URLClassLoader.findClass(URLClassLoader.java:609)
    at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:924)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:869)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:329)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:852)
    ... 6 more

Do you know what am I doing wrong..

ajs6f commented 6 years ago

Yep, I know exactly what you did wrong-- nothing.

I goofed up with that last commit changing dependencies-- I did more than I should, or less than I should, depending on your POV. I just added a commit that fixes my mistake, so can you (SORRY!) please try again?

whikloj commented 6 years ago

Ok that sent me back to the previous (now deleted) comment. Essentially the extractor can't find any (I tried 10) objects, it appears like all the hashes are different.

Error log is

INFO 2018-04-02 08:37:53.739 [main] (Extract) Using 4 threads for extraction and a queue size of 1048576.
INFO 2018-04-02 08:37:53.742 [main] (Extract) with Akubra configuration from /usr/local/fedora/server/config/spring/akubra-llstore.xml.
INFO 2018-04-02 08:37:53.916 [main] (Extract) Beginning extraction.
INFO 2018-04-02 08:37:53.918 [pool-2-thread-1] (ObjectProcessor) Operating on object URI: uofm:2652608
INFO 2018-04-02 08:37:53.918 [pool-2-thread-2] (ObjectProcessor) Operating on object URI: uofm:1653091
INFO 2018-04-02 08:37:53.918 [pool-2-thread-3] (ObjectProcessor) Operating on object URI: uofm:2108043
INFO 2018-04-02 08:37:53.919 [pool-2-thread-4] (ObjectProcessor) Operating on object URI: uofm:2457369
ERROR 2018-04-02 08:37:53.925 [pool-2-thread-1] (ObjectProcessor) Error reading from object: uofm:2652608
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:bc/uofm%3A2652608')
        at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
        at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:135)
        at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
        at edu.si.fcrepo.Extract$$Lambda$25.00000000289A9230.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:811)
ERROR 2018-04-02 08:37:53.925 [pool-2-thread-4] (ObjectProcessor) Error reading from object: uofm:2457369
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:d0/uofm%3A2457369')
        at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
        at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:135)
        at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
        at edu.si.fcrepo.Extract$$Lambda$25.00000000289A9230.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:811)
INFO 2018-04-02 08:37:53.925 [pool-2-thread-1] (ObjectProcessor) Operating on object URI: uofm:1801635
INFO 2018-04-02 08:37:53.925 [pool-2-thread-4] (ObjectProcessor) Operating on object URI: uofm:1360961
ERROR 2018-04-02 08:37:53.925 [pool-2-thread-2] (ObjectProcessor) Error reading from object: uofm:1653091
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:dc/uofm%3A1653091')
        at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
        at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:135)
        at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
        at edu.si.fcrepo.Extract$$Lambda$25.00000000289A9230.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:811)
INFO 2018-04-02 08:37:53.925 [pool-2-thread-2] (ObjectProcessor) Operating on object URI: uofm:2797288
ERROR 2018-04-02 08:37:53.926 [pool-2-thread-3] (ObjectProcessor) Error reading from object: uofm:2108043
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:e9/uofm%3A2108043')
        at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
        at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:135)
        at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
        at edu.si.fcrepo.Extract$$Lambda$25.00000000289A9230.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:811)
INFO 2018-04-02 08:37:53.926 [pool-2-thread-3] (ObjectProcessor) Operating on object URI: uofm:2736579
ERROR 2018-04-02 08:37:53.928 [pool-2-thread-4] (ObjectProcessor) Error reading from object: uofm:1360961
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:dd/uofm%3A1360961')
        at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
        at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:135)
        at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
        at edu.si.fcrepo.Extract$$Lambda$25.00000000289A9230.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:811)
INFO 2018-04-02 08:37:53.928 [pool-2-thread-4] (ObjectProcessor) Operating on object URI: uofm:2116284
ERROR 2018-04-02 08:37:53.928 [pool-2-thread-1] (ObjectProcessor) Error reading from object: uofm:1801635
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:d8/uofm%3A1801635')
        at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
        at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:135)

... repeat errors for all 10 PIDs

INFO 2018-04-02 08:37:53.935 [main] (Extract) Reached 10 objects at end of objects with 0 in-queue after 10 errors.
INFO 2018-04-02 08:37:53.935 [main] (Extract) Finished extraction.

4 quad files with 0 lines in each.

ajs6f commented 6 years ago

Ah, I have a suspicion here-- when you say "all the hashes are different" do you mean that the hashes that this tool are calculating for where things should be stored on the filesystem are different from the hashes that the repo is calculating for the same purpose?

whikloj commented 6 years ago

Ugh I am a moron, I used the PIDs on the CLI as "uofm:2652608" instead of the correct "info:fedora/uofm:2652608". Now it runs and throws the javax.xml.stream.XMLStreamException: The namespace URI "http://islandora.ca/ontology/relsext#" has not been bound to a prefix. error.

whikloj commented 6 years ago

Here is the FOXML for object info:fedora/uofm:2457369 which failed above.

ajs6f commented 6 years ago

Let's establish one thing clearly: you are not a moron. You are simply a Fedora user. It's easy to look like a moron while using Fedora. I do it weekly. It says a lot more about Fedora than about any of its users.

But now we've got a real bite on the problem. It's clearly an XML processing problem. The weird thing that I have seen and read about this kind of message, but only while writing XML. I have never seen it turn up while reading it. In any event, time for me to dig in a bit on this. Further bulletins as events warrant.

ajs6f commented 6 years ago

@whikloj -- can you add here the log for when the tool hits this object? Or gist it somewhere, whatever is convenient. Don't bother excerpting, just paste the whole dang thing, if you please. Thanks!

whikloj commented 6 years ago

The section of the log is quite simply.

ERROR 2018-04-02 09:11:37.195 [pool-2-thread-4] (ObjectProcessor) Error reading from object: info:fedora/uofm:2736579
java.io.IOException: Error parsing foxml:xmlContent
        at com.github.cwilper.fcrepo.dto.foxml.FOXMLReader.readXMLContent(FOXMLReader.java:211)
        at com.github.cwilper.fcrepo.dto.foxml.FOXMLReader.readDatastreamVersion(FOXMLReader.java:149)
        at com.github.cwilper.fcrepo.dto.foxml.FOXMLReader.readDatastream(FOXMLReader.java:119)
        at com.github.cwilper.fcrepo.dto.foxml.FOXMLReader.readObject(FOXMLReader.java:77)
        at com.github.cwilper.fcrepo.dto.foxml.FOXMLReader.readObject(FOXMLReader.java:61)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:139)
        at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
        at edu.si.fcrepo.Extract$$Lambda$25.000000002C9A91A0.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
        at java.util.concurrent.FutureTask.run(FutureTask.java:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(Thread.java:811)
Caused by: javax.xml.stream.XMLStreamException: The namespace URI "http://islandora.ca/ontology/relsext#" has not been bound to a prefix.
        at com.ibm.xml.xlxp.api.stax.msg.StAXMessageProvider.throwXMLStreamException(StAXMessageProvider.java:59)
        at com.ibm.xml.xlxp.api.stax.XMLStreamWriterBase.writeStartElement(XMLStreamWriterBase.java:357)
        at com.ibm.xml.xlxp.api.stax.XMLOutputFactoryImpl$XMLStreamWriterProxy.writeStartElement(XMLOutputFactoryImpl.java:120)
        at javanet.staxutils.io.XMLWriterUtils.writeStartElement(XMLWriterUtils.java:1254)
        at javanet.staxutils.XMLStreamEventWriter.sendEvent(XMLStreamEventWriter.java:121)
        at javanet.staxutils.BaseXMLEventWriter.add(BaseXMLEventWriter.java:519)
        at javanet.staxutils.BaseXMLEventWriter.add(BaseXMLEventWriter.java:527)
        at javanet.staxutils.XMLStreamUtils.copy(XMLStreamUtils.java:404)
        at com.github.cwilper.fcrepo.dto.core.io.XMLUtil.copy(XMLUtil.java:233)
        at com.github.cwilper.fcrepo.dto.core.io.XMLUtil.copy(XMLUtil.java:206)
        at com.github.cwilper.fcrepo.dto.foxml.FOXMLReader.readXMLContent(FOXMLReader.java:208)
        ... 12 common frames omitted

I could turn up the logging level if you'd like?

ajs6f commented 6 years ago

No, that's good, that's the info I was looking for. The crucial lines are:

com.ibm.xml.xlxp.api.stax.msg.StAXMessageProvider.throwXMLStreamException(StAXMessageProvider.java:59)

and the lines below it. They tell me that this isn't my code, it's @cwilper's FOXML-reading code from a million years ago.

So that makes me feel better. And it's worse, because now I have to go into that (I hope totally unmaintained, because I'm sure Chris has better things to do at this point in his career) code and find the problem and then find a way of fixing it. Hopefully (crosses-fingers) I can set a system property or do something else to control the behavior of that library's XML-reading code. If not... I get out the mattock and the spade.

ajs6f commented 6 years ago

And, now I know why I've only ever seen this in writing XML but it's showing up here. Looks like @cwilper is copying all the XML around. See around:

at com.github.cwilper.fcrepo.dto.core.io.XMLUtil.copy(XMLUtil.java:233)

Ok, where is my pickaxe?

ajs6f commented 6 years ago

Bad news, @whikloj -- I can't reproduce. Using your FOXML and master, I get a fine run and the quads:

<info:fedora/uofm:2457369/JP2> <http://islandora.ca/ontology/relsext#width> "4619" <#ri> .
<info:fedora/uofm:2457369/JP2> <http://islandora.ca/ontology/relsext#height> "7150" <#ri> .

appear with no problem...

ajs6f commented 6 years ago

Can you tell me a bit about your Java setup? What version, and did you put any extra libraries into play by installing them in your Java home itself? For example, did you install any XML processing libraries there?

whikloj commented 6 years ago

Not that I know of, maybe too old of a version?

[whikloj@jujo]~% java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 8.0.5.5 - pxa6480sr5fp5-20171114_01(SR5 FP5))
IBM J9 VM (build 2.9, JRE 1.8.0 Linux amd64-64 Compressed References 20171102_369060 (JIT enabled, AOT enabled)
OpenJ9   - 7ade437
OMR      - 1b656cb
IBM      - 59c3d96)
JCL - 20171113_01 based on Oracle jdk8u151-b12
ajs6f commented 6 years ago

No, actually I suspect the fact that it is an IBM JVM. See here. Can you possibly try using another JVM? (E.g. OpenJDK or really, anything else, just to get the comparison?)

whikloj commented 6 years ago

@ajs6f for the win.

[whikloj@jujo]~% java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
INFO 2018-04-02 11:23:28.452 [main] (Extract) Using 4 threads for extraction and a queue size of 1048576.
INFO 2018-04-02 11:23:28.454 [main] (Extract) with Akubra configuration from /usr/local/fedora/server/config/spring/akubra-llstore.xml.
INFO 2018-04-02 11:23:28.613 [main] (Extract) Beginning extraction.
INFO 2018-04-02 11:23:28.614 [pool-2-thread-2] (ObjectProcessor) Operating on object URI: info:fedora/uofm:1653091
INFO 2018-04-02 11:23:28.614 [pool-2-thread-1] (ObjectProcessor) Operating on object URI: info:fedora/uofm:2652608
INFO 2018-04-02 11:23:28.629 [pool-2-thread-4] (ObjectProcessor) Operating on object URI: info:fedora/uofm:2457369
INFO 2018-04-02 11:23:28.629 [pool-2-thread-3] (ObjectProcessor) Operating on object URI: info:fedora/uofm:2108043
INFO 2018-04-02 11:23:29.029 [pool-2-thread-2] (ObjectProcessor) Operating on object URI: info:fedora/uofm:1801635
INFO 2018-04-02 11:23:29.032 [pool-2-thread-3] (ObjectProcessor) Operating on object URI: info:fedora/uofm:1360961
INFO 2018-04-02 11:23:29.035 [pool-2-thread-4] (ObjectProcessor) Operating on object URI: info:fedora/uofm:2797288
INFO 2018-04-02 11:23:29.041 [pool-2-thread-1] (ObjectProcessor) Operating on object URI: info:fedora/uofm:2736579
INFO 2018-04-02 11:23:29.101 [pool-2-thread-3] (ObjectProcessor) Operating on object URI: info:fedora/uofm:2116284
INFO 2018-04-02 11:23:29.113 [pool-2-thread-2] (ObjectProcessor) Operating on object URI: info:fedora/uofm:1374689
INFO 2018-04-02 11:23:29.187 [main] (Extract) Reached 10 objects at end of objects with 0 in-queue after 0 errors.
INFO 2018-04-02 11:23:29.187 [main] (Extract) Finished extraction.
ajs6f commented 6 years ago

That was it? If that was it I will try adding a commit that will do what was described in that Stackoverflow post. Then you can run with the JVM you were using.

whikloj commented 6 years ago

That was all I changed, I even had this JVM installed just needed to select it with update-alternatives

ajs6f commented 6 years ago

k, I just added a dependency and setting for using Woodstox. If/when you have time, please try it out again with the IBM JVM. If it works, I reckon we can close this and move on the next whack-a-mole.