ajs6f / fcrepo3-rdf-extractor

A utility to extract RDF triples from Fedora Commons 3 Akubra-based persistence stores.
Other
0 stars 2 forks source link

Doesn't like my config files. #1

Closed whikloj closed 7 years ago

whikloj commented 7 years ago

Tried to use this but am having trouble with my Akubra storage config file.

[whikloj@juno]/opt/fcrepo3-rdf-extractor% java -jar target/fcrepo3-rdf-extractor-0.0.1-SNAPSHOT.jar -a /usr/local/fedora/server/config/spring/akubra-llstore.xml -o /local/dam/staging/jareds_triples/juno_20161121.sparql
INFO 15:50:26.012 (edu.si.fcrepo.Extract) Using 4 threads for extraction and a queue size of 1048576.
INFO 15:50:26.018 (edu.si.fcrepo.Extract) Extracting to /local/dam/staging/jareds_triples/juno_20161121.sparql...
INFO 15:50:26.018 (edu.si.fcrepo.Extract) with Akubra configuration from /usr/local/fedora/server/config/spring/akubra-llstore.xml.
INFO 15:50:26.082 (org.springframework.context.support.FileSystemXmlApplicationContext) Refreshing org.springframework.context.support.FileSystemXmlApplicationContext@97e93f1: startup date [Mon Nov 21 15:50:26 GMT-06:00 2016]; root of context hierarchy
INFO 15:50:26.111 (org.springframework.beans.factory.xml.XmlBeanDefinitionReader) Loading XML bean definitions from URL [file:/usr/local/fedora/server/config/spring/akubra-llstore.xml]
INFO 15:50:26.169 (org.springframework.beans.factory.support.DefaultListableBeanFactory) Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@4fa1c212: defining beans [org.fcrepo.server.storage.lowlevel.ILowlevelStorage,org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorage,objectStore,fsObjectStore,fsObjectStoreMapper,datastreamStore,fsDatastreamStore,fsDatastreamStoreMapper,fedoraStorageHintProvider]; root of factory hierarchy
INFO 15:50:26.170 (org.springframework.beans.factory.support.DefaultListableBeanFactory) Destroying singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@4fa1c212: defining beans [org.fcrepo.server.storage.lowlevel.ILowlevelStorage,org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorage,objectStore,fsObjectStore,fsObjectStoreMapper,datastreamStore,fsDatastreamStore,fsDatastreamStoreMapper,fedoraStorageHintProvider]; root of factory hierarchy
Exception in thread "main" org.springframework.beans.factory.CannotLoadBeanClassException: Cannot find class [org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorageModule] for bean with name 'org.fcrepo.server.storage.lowlevel.ILowlevelStorage' defined in URL [file:/usr/local/fedora/server/config/spring/akubra-llstore.xml]; nested exception is java.lang.ClassNotFoundException: org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorageModule
        at org.springframework.beans.factory.support.AbstractBeanFactory.resolveBeanClass(AbstractBeanFactory.java:1261)
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.predictBeanType(AbstractAutowireCapableBeanFactory.java:575)
        at org.springframework.beans.factory.support.AbstractBeanFactory.isFactoryBean(AbstractBeanFactory.java:1330)
        at org.springframework.beans.factory.support.AbstractBeanFactory.isFactoryBean(AbstractBeanFactory.java:896)
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:566)
        at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:895)
        at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:425)
        at org.springframework.context.support.FileSystemXmlApplicationContext.<init>(FileSystemXmlApplicationContext.java:140)
        at org.springframework.context.support.FileSystemXmlApplicationContext.<init>(FileSystemXmlApplicationContext.java:84)
        at edu.si.fcrepo.Extract.init(Extract.java:194)
        at edu.si.fcrepo.Extract.main(Extract.java:157)
Caused by: java.lang.ClassNotFoundException: org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorageModule
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.springframework.util.ClassUtils.forName(ClassUtils.java:257)
        at org.springframework.beans.factory.support.AbstractBeanDefinition.resolveBeanClass(AbstractBeanDefinition.java:408)
        at org.springframework.beans.factory.support.AbstractBeanFactory.doResolveBeanClass(AbstractBeanFactory.java:1282)
        at org.springframework.beans.factory.support.AbstractBeanFactory.resolveBeanClass(AbstractBeanFactory.java:1253)
        ... 10 more
[whikloj@juno]/opt/fcrepo3-rdf-extractor% 

My akubra-llstore.xml -> https://gist.github.com/whikloj/584dea271c6e872e4b3d574676781bcc

ajs6f commented 7 years ago

@whikloj, this is a semi-known issue that has to do with avoiding classpath problems that arise when pulling in the gargantuan Fedora server classpath. I know the problem and have a fix, which I will get taken care of sometime in the next few days (Tgiving holiday). In the meantime, there is a workaround that @ruebot knows or with which I can help you via IRC in the next day or so. It involves removing all of the org.fcrepo.server.storage beans from a copy of your Akubra config. It will end up looking somewhat like this but with different ID mappers.

whikloj commented 7 years ago

Ok, thanks. I'll bug @ruebot about it tomorrow. No rush.

ruebot commented 7 years ago

I got yo back!

/md1200/vol1/fedora_data is your shibboleth.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>

  <bean name="objectStore" class="org.akubraproject.map.IdMappingBlobStore"
    singleton="true">
    <constructor-arg value="urn:example.org:objectStore" />
    <constructor-arg>
      <ref bean="fsObjectStore" />
    </constructor-arg>
    <constructor-arg>
      <ref bean="fsObjectStoreMapper" />
    </constructor-arg>
  </bean>

  <bean name="fsObjectStore" class="org.akubraproject.fs.FSBlobStore"
    singleton="true">
    <constructor-arg value="urn:example.org:fsObjectStore" />
    <constructor-arg value="/md1200/vol1/fedora_data/objectStore"/>
  </bean>

  <bean name="fsObjectStoreMapper"
    class="org.fcrepo.server.storage.lowlevel.akubra.HashPathIdMapper"
    singleton="true">
    <constructor-arg value="##" />
  </bean>

  <bean name="datastreamStore" class="org.akubraproject.map.IdMappingBlobStore"
    singleton="true">
    <constructor-arg value="urn:fedora:datastreamStore" />
    <constructor-arg>
      <ref bean="fsDatastreamStore" />
    </constructor-arg>
    <constructor-arg>
      <ref bean="fsDatastreamStoreMapper" />
    </constructor-arg>
  </bean>

  <bean name="fsDatastreamStore" class="org.akubraproject.fs.FSBlobStore"
    singleton="true">
    <constructor-arg value="urn:example.org:fsDatastreamStore" />
    <constructor-arg value="/md1200/vol1/fedora_data/datastreamStore"/>
  </bean>

  <bean name="fsDatastreamStoreMapper"
    class="org.fcrepo.server.storage.lowlevel.akubra.HashPathIdMapper"
    singleton="true">
    <constructor-arg value="##" />
  </bean>

</beans>
ruebot commented 7 years ago

...and @whikloj

java -jar fcrepo3-rdf-extractor-0.0.1-SNAPSHOT.jar -a /usr/local/fedora/server/config/spring/akubra-llstore.xml -o /md1200/vol1/backup/yudl_triples.n3 2>&1 | tee ~/hotindexer.log is what I used.

...and you'll want to use the --graph option as well, and specify a URI for <#ri> which could be something like info:edu.si.fedora#ri according to @ajs6f

ajs6f commented 7 years ago

If you are expecting to use this with the trippi-sparql connector, then the simplest thing to do graphname-wise is exactly what @ruebot writes. I need to document what's going on there better. (Short story: the <#ri> URI that Fedora uses by default is relative-- it's illegal to have a relative URI in that slot, but that wasn't clear years ago.)

ajs6f commented 7 years ago

@whikloj I have a much simpler workaround to try: please try adding a single attribute default-lazy-init="true" to the top-level beans element in your Akubra config file. This will work fine with both your repo, and the hot indexer.

whikloj commented 7 years ago

Cool, I'm just rebuilding some derivatives and then I'll give this a try.

whikloj commented 7 years ago

Ok so the problem in my akubra-llstore.xml still exists, adding default-lazy-init="true" just hid it. I added a logback.xml with root at DEBUG and got the following log. rdf-extractor.log.

I'm trying @ruebot's file example as I think the <bean name="org.fcrepo.server.storage.lowlevel.ILowlevelStorage" and <bean name="org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorage" are the problem.

ajs6f commented 7 years ago

@ruebot's example should certainly work, but it is odd that the default-lazy-init="true" thing works for me and not you. Can I see your Akubra file?

ajs6f commented 7 years ago

Wait, I think you are wrong-- it is not failing, because you are getting to here. I think you are fine. You are just seeing warnings, not errors. Are you getting triples?

whikloj commented 7 years ago

@ajs6f TRIPLES!!!!

ajs6f commented 7 years ago

I'll see what I can do to hide those annoying and confusing stacktraces. Meanwhile, enjoy your Usan Thanksgiving triples.

whikloj commented 7 years ago

So this is working with @ruebot's modified akubra-llstore.xml. When its done I'll try running it against my original file to see if I just needed to wait a little bit more for it to start processing.

whikloj commented 7 years ago

My run finally completed, I will try to start it again using the original akubra-llstore.xml.

My quad file contains 121,819,261 lines (or quads), but a count query of my entire Mulgara has 125,262,308 which leaves 3,443,047 not accounted for.

Is it possible that there are internal triples that would not be persisted on the object in the filesystem?

ajs6f commented 7 years ago

It's not obvious that there would be any such triples. My first guess would be that some objects or datastreams weren't readable at the moment that mattered. Can you check the content of the difference by diffing the output of the hot indexer against a complete NQuads dump of Mulgara (you will need to sort them first)? I appreciate that so doing will take a lot of time and computation, but hopefully not too much? I'd like to know what the actual differences are before theorizing.

whikloj commented 7 years ago

I'm not sure I can get NQuads from Mulgara...checking into that.

But you are right it took a little but using the default-lazy-init="true" in my normal akubra-llstore.xml did work.

ajs6f commented 7 years ago

Okay, to the latter, good, I will update the README to that effect and it will doubtless help others.

To the former, you can always dump NTriples out of <#ri> and use shell commands to add the fourth field.

ajs6f commented 7 years ago

I am currently testing the new "avoid piling up URIs in a list" commits and I will let you know as soon as I am confident in them.

whikloj commented 7 years ago

Yeah, there are commands to do a backup of Mulgara, but they require access to the server. I'm looking in the fcrepo3 code but I don't know that either a) mulgara is running separately or b) if it is that the server is exposed at all.

I wanted to try this, and I got the client library but I need the host:port to connect to.

I tried doing a query and I have a choice of xml and json. So it would require exporting it all, then transforming it all into n-quads, then sorting, then comparing.

So this might take some time.

ajs6f commented 7 years ago

You should be able to use a query at the /risearch endpoint to do this much more easily, and directly in the right format:

https://wiki.duraspace.org/display/FEDORA38/Resource+Index+Search#ResourceIndexSearch-ResponseFormatsresponse-formats

whikloj commented 7 years ago

@ajs6f++

Why is Fedora's Mulgara documentation better than Mulgara's own?! Crazy, this is working. I'll start it now.

ajs6f commented 7 years ago

Well, Fedora 3 remained under maintenance for years after Mulgara wasn't, so that's probably got somewhat to do with it.

ajs6f commented 7 years ago

Okay, @whikloj , I've committed the new streaming code. Please try it out-- it should get rid of that annoying delay before triples start arriving. Although it won't do anything for your slow storage....

whikloj commented 7 years ago

Ok it took a bit but I have an N-Quad file of all my triples from Mulgara, then I sorted both files (at 17GB a piece that took some time and space).

I couldn't use diff on this machine so I'm going to try moving it to another server and execute it there.

For now I will say it is obvious there is stuff in Mulgara that is not in the rdf-extractor output.

Simple head output for each sorted file shows.

[whikloj@juno]/var/indexes/triples% head juno_sorted.nq 
<info:fedora/changeme:13746/DC> <info:fedora/fedora-system:def/model#state> <info:fedora/fedora-system:def/model#Active> <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746/DC> <info:fedora/fedora-system:def/view#disseminationType> <info:fedora/*/DC> <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746/DC> <info:fedora/fedora-system:def/view#isVolatile> "false" <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746/DC> <info:fedora/fedora-system:def/view#lastModifiedDate> "2015-06-11T15:45:32.275Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746/DC> <info:fedora/fedora-system:def/view#mimeType> "text/xml" <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746> <http://islandora.ca/ontology/relsext#generate_ocr> "TRUE" <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746> <http://islandora.ca/ontology/relsext#isPageNumber> "12" <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746> <http://islandora.ca/ontology/relsext#isPageOf> <info:fedora/changeme:13745> <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746> <http://islandora.ca/ontology/relsext#isSection> "1" <info:ca.umanitoba.fedora#ri> .
<info:fedora/changeme:13746> <http://islandora.ca/ontology/relsext#isSequenceNumber> "12" <info:ca.umanitoba.fedora#ri> .

versus

[whikloj@juno]/var/indexes/triples% head mulgara_sorted.nq 
<info:fedora/alan:testObject2/DC> <info:fedora/fedora-system:def/model#state> <info:fedora/fedora-system:def/model#Active> <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2/DC> <info:fedora/fedora-system:def/view#disseminationType> <info:fedora/*/DC> <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2/DC> <info:fedora/fedora-system:def/view#isVolatile> "false" <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2/DC> <info:fedora/fedora-system:def/view#lastModifiedDate> "2012-07-26T19:04:56.856Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2/DC> <info:fedora/fedora-system:def/view#mimeType> "text/xml" <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2> <http://purl.org/dc/elements/1.1/identifier> "alan:testObject2" <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2> <http://purl.org/dc/elements/1.1/title> "Alan's Test Object2" <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2> <info:fedora/fedora-system:def/model#createdDate> "2012-07-26T19:04:56.856Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2> <info:fedora/fedora-system:def/model#hasModel> <info:fedora/fedora-system:FedoraObject-3.0> <info:ca.umanitoba.fedora#ri> .
<info:fedora/alan:testObject2> <info:fedora/fedora-system:def/model#label> "Alan's Test Object2" <info:ca.umanitoba.fedora#ri> .

What is confusing me is where did Mulgara get <info:fedora/alan:testObject2> from? I rebuilt this index from the filesystem using the stock indexer only about 2 weeks ago. Weird.

ajs6f commented 7 years ago

Did you clean out Mulgara before reindexing into it?

ajs6f commented 7 years ago

See https://github.com/fcrepo3/fcrepo/blob/master/fcrepo-server/src/main/java/org/fcrepo/server/resourceIndex/ResourceIndexRebuilder.java#L177

ajs6f commented 7 years ago

Actually, looks like you might be okay using embedded Mulgara in particular: https://github.com/fcrepo3/fcrepo/blob/master/fcrepo-server/src/main/java/org/fcrepo/server/resourceIndex/ResourceIndexRebuilder.java#L191

ajs6f commented 7 years ago

Can you verify that the directory containing Mulgara's data was created at the datetime of your last full rebuild?

whikloj commented 7 years ago

Yeah I remember it says that it is cleaning it out and the directory was created on October 28. I thought it was more recent but that is probably correct.

whikloj commented 7 years ago

I'm scanning the objectStore for a file starting with info%3Afedora%2Falan%3A* to see if anything exists (that is easy to find).

I'm gonna try writing a little python script program to compare the files line by line and create a less memory intensive (but probably time intensive) diff, first gotta do some weekend stuff. I'll check back later.

ajs6f commented 7 years ago

Well, is it a problem with the rebuilder or the hot indexer? In other words, are those extra triples actually generated from real objects, or not? E.g. is there a alan:testObject2 in the repo?

whikloj commented 7 years ago

Yes apparently there is. Seemingly we have some very old vendor test objects in our repository.

So I should remove them, but that doesn't explain why the hot indexer didn't seem to find them.

On 26 Nov 2016 11:20 a.m., "A. Soroka" notifications@github.com wrote:

Well, is it a problem with the rebuilder or the hot indexer? In other words, are those extra triples actually generated from real objects, or not? E.g. is there a alan:testObject2 in the repo?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ajs6f/fcrepo3-rdf-extractor/issues/1#issuecomment-263075125, or mute the thread https://github.com/notifications/unsubscribe-auth/ACua4VQhEjd_VDg9L4iRCWgxyN3e2Yokks5rCGp0gaJpZM4K4vnG .

ajs6f commented 7 years ago

No, you are right about that. I know it must be a large file, but can I get access to the log of your hot indexer run somewhere?

ajs6f commented 7 years ago

Actually, @whikloj , can you close this ticket (because we got the prob with your conf file resolved, at least to first order) and open a new one specifically about the missed objects?

whikloj commented 7 years ago

Absolutely