ajs6f / fcrepo3-rdf-extractor

A utility to extract RDF triples from Fedora Commons 3 Akubra-based persistence stores.
Other
0 stars 2 forks source link

Missed triples #3

Closed whikloj closed 6 years ago

whikloj commented 7 years ago

A run of the rdf extractor seems to miss some triples. A full run of ours collected 121,819,261 but a count of the existing triple store showed 125,262,308.

whikloj commented 7 years ago

A preliminary view of the output from Mulgara shows extra triples from very old and small test objects. These seem to not exist on the filesystem, but I'll need to check on Monday to get a better idea what the whole record looks like.

whikloj commented 7 years ago

At first I thought these files did not exist on the filesystem, but that is incorrect. I was able to locate these files in the objectStore.

whikloj commented 7 years ago

One of my objects not indexed.

<?xml version="1.0" encoding="UTF-8"?>
<foxml:digitalObject VERSION="1.1" PID="alan:testObject"
xmlns:foxml="info:fedora/fedora-system:def/foxml#"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="info:fedora/fedora-system:def/foxml# http://www.fedora.info/definitions/1/0/foxml1-1.xsd">
<foxml:objectProperties>
<foxml:property NAME="info:fedora/fedora-system:def/model#state" VALUE="Active"/>
<foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="Alan Test Object"/>
<foxml:property NAME="info:fedora/fedora-system:def/model#ownerId" VALUE=""/>
<foxml:property NAME="info:fedora/fedora-system:def/model#createdDate" VALUE="2012-07-26T18:31:01.356Z"/>
<foxml:property NAME="info:fedora/fedora-system:def/view#lastModifiedDate" VALUE="2013-02-21T20:17:58.762Z"/>
</foxml:objectProperties>
<foxml:datastream ID="DC" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
<foxml:datastreamVersion ID="DC1.0" LABEL="Dublin Core Record for this object" CREATED="2012-07-26T18:31:01.356Z" MIMETYPE="text/xml" FORMAT_URI="http://www.openarchives.org/OAI/2.0/oai_dc/" SIZE="384">
<foxml:xmlContent>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:title>Alan Test Object</dc:title>
  <dc:identifier>alan:testObject</dc:identifier>
</oai_dc:dc>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>
</foxml:digitalObject>
ajs6f commented 7 years ago

Something weird is definitely going on here-- that is not a valid object. It is valid FOXML on the level of XSD Schema, but not on the level of Schematron and not in terms of the model (missing RELS-EXT and especially missing AUDIT is not legal).

Is it possible that these were not created by Fedora, but by someone writing XML onto the filesystem? I cannot see how they could have been created by Fedora code.

whikloj commented 7 years ago

I'm not really sure, but based on the created date and our history with Fedora/Islandora this is probably an original setup type object from a vendor. It might have been written directly on the filesystem, but I can't see why. It definitely would have been a Fedora 3.7 object (possibly 3.6).

ajs6f commented 7 years ago

Well, it's pretty weird, because if you try creating a test object right now (however you like, via import or POSTing to objects/new or whatever) you will find that it doesn't look like that. Can you check another one or two or the objects that the hot indexer doesn't see and see if they are also missing required datastreams?

whikloj commented 7 years ago

Do you want the entire output? I ask because I wrote my script and am generating what should be a list of the missing triples from the two files. Which should be a more manageable size.

ajs6f commented 7 years ago

A diff would be more useful, yes, and also, can you tell if it is the case that the hot indexer output is missing triples from the stock output, or if they are both missing triples from each other?

whikloj commented 7 years ago

I'm having troubles with UTF-8 encoding (apparently I may have exported from Mulgara without it) and date issues (also from Mulgara). This has made getting an accurate diff very hard. I'm still looking into it, but at this point it appears that the triples when missing are not one or two from an object but the entire object. So I might look at a strict subject comparison to see what it looks like.

ajs6f commented 7 years ago

Okay, that's what I figured (that some objects aren't getting done at all). Thanks for staying with this!

whikloj commented 7 years ago

I haven't forgotten about this, I was tweaking my diff script to fix my own mistakes (somehow my pull from mulgara output backslash escaped unicode but the rdf extractor used nicely encoded unicode).

Anyways I am also deduplicating as I have noticed I am hitting duplicates in the rdf extractor triples.

<info:fedora/islandora:12369> <http://purl.org/dc/elements/1.1/coverage> "" <info:ca.umanitoba.fedora#ri> .
<info:fedora/islandora:12369> <http://purl.org/dc/elements/1.1/coverage> "" <info:ca.umanitoba.fedora#ri> .

Once I can filter and log those, I'll let you know how many of those there are. In the end this might not be of much help.

ajs6f commented 7 years ago

Awesome, @whikloj -- I was just about to ping you on this. The duplicates are normal-- they mean that your object (looks like probably your DC datastream) has two empty coverage elements. The extractor does not deduplicate triples-- that would be crazy expensive, and adding a triple to a graph is idempotent for that graph (it may cause events to be triggered in some triplestores, but that's a separate question). You will get the same result. I wouldn't bother deduplicating, or rather, if you are going to do that, look at the original object, not the extracted RDF.

whikloj commented 7 years ago

Seeing as I was using a simple wc -l of the triple file as a count. I may have fewer triples than I thought and the gap between what Mulgara was holding and the extractor found may have increased.

ajs6f commented 7 years ago

Dan's notes say:

The traditional Fedora rebuilder will not remove existing triples in Fuseki/JENA as part of its execution. It will add new triples. If you need a clean dataset you need to remove the existing triples (or dataset) prior to rebuilding.

Are you sure we aren't running into that?

whikloj commented 7 years ago

I think the difference is going to come down to issues with our NFS... but I'll keep going for what it might be worth.

whikloj commented 7 years ago

I'm really not sure, but if so then I should end up with triples in my triplestore for resources not in Fedora anymore, right?

ajs6f commented 7 years ago

Ish. If the objects were deleted by Fedora, the triples should have been deleted as well. So we're looking for the following sequence: 1) You have some objects. 2) You shut down the repo. 3) Some of the files go away (NFS problems? Accidentally admin action?) 4) You do a SQL and 3store rebuild.

So you have those objects that were not deleted but are gone, and you end up with triples left behind. Make sense?

whikloj commented 7 years ago

Ok, this is no good. I finally got a diff and its wrong. I think in transferring and sorting the triple files I must have lost some data.

Best case scenario is for me to try and run it again (but slow it down to allow for our NFS storage) and push that new triple file to a separate Blazegraph instance. Then figure out a way to traverse both graphs looking for differences.

ajs6f commented 7 years ago

Okay, I'm sorry this is being annoying for you. To slow it down throttle the number of threads in use and tighten the queue.

whikloj commented 7 years ago

No worries, its not your fault. I think I must have screwed up when I sorted the triple file and never noticed that I lost some data. Luckily it was after I had added it all to the new triplestore 😄

whikloj commented 6 years ago

So I am finally getting around to running this against our system again, unfortunately I think we are in the process of ingesting some more newspapers so the numbers might be hard to match up. But hopefully closer than last time. It is moving a lot quicker than in past but the campus IT have increased the NFS head since my last attempt.

I'll keep you informed.

ajs6f commented 6 years ago

Rad!

ajs6f commented 6 years ago

@whikloj Is this still a live thread?

whikloj commented 6 years ago

Not sure, maybe the test indicated https://github.com/ajs6f/fcrepo3-rdf-extractor/issues/4#issuecomment-377249785 will make it clear if stuff is being dropped.

ajs6f commented 6 years ago

Okay, cool. It could even be the same problem as we are discussing at https://github.com/ajs6f/fcrepo3-rdf-extractor/issues/5.

whikloj commented 6 years ago

Ok, I now that #5 seems resolved and I am able to run this from our new server on our production data but without bothering the production Fedora. So maybe I can help resolve somethings now.

It is running and I just saw this whip by in the log.

INFO 2018-04-02 15:26:22.962 [pool-2-thread-4] (Extract) Reached 901120 objects at info:fedora/uofm:1847167 with 305944 in-queue after 446 errors.

Of the errors I saw this one.

**ERROR 2018-04-02 15:26:23.680 [pool-2-thread-1] (ObjectProcessor) Couldn't extract triples from datastream [NO DS ID] from object info:fedora/uofm:1236828! Caused by:
java.lang.NullPointerException: null**

Going to try and find that file to see if there is something obvious.

whikloj commented 6 years ago

New alternate error also occurred. I'm going to throw them all here, some might not be a problem with the program as much our setup. But if you think they big or distinct enough we can open separate tickets.

ERROR 2018-04-02 15:26:32.514 [pool-2-thread-1] (ObjectProcessor) Couldn't find datastream DC from object info:fedora/uofm:2939588! Caused by:
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:0f/uofm%3A2939588%2BDC%2BDC.0')
    at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
    at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
    at edu.si.fcrepo.ObjectProcessor.getDatastreamContent(ObjectProcessor.java:205)
    at edu.si.fcrepo.ObjectProcessor.consume(ObjectProcessor.java:180)
    at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:152)
    at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
ajs6f commented 6 years ago

Can we break these guys apart? I don't think they are caused by the same problems. My big question for this ticket is what the percentage is. I.e., out of how many quads you should get, how many you did get. If they are less than 1k-ish apart, I'm inclined to close this ticket, with the understanding that it was basically the #5 problem, and open two fresh ones for these two issues you are finding now.

(Unless, of course, there turns out to be some bobble in the FOXML that caused them.)

whikloj commented 6 years ago

I'm going to close this... I don't think, short of starting with a completely fresh repository, there is a way to really get a comparison.

I did a sample of 104 PIDS from my production repository.

I ended up with 12,232 triples when querying the current triplestore versus 12,136 produced by this tool (difference of 96).

Of this most seemed to be multiple <info:fedora/fedora-system:def/view#lastModifiedDate> triples from the triplestore, which seems to be its failure but as I migrated this same content from Mulgara to Blazegraph it could be historical.

Additionally, in the triplestore several empty string triples become one where in this tool you either get them all or none.

Lastly was 1 or 2 missing triples from the triplestore.

All of this adds up to say that (in this limited sample) this tool seemed to produce a better set of triples.

As this next migration is to a new box, I am going to try just using the output of this tool and see if anything blows up

ajs6f commented 6 years ago

Okay, cool. I'm sorry it wasn't more helpful on this go-around. I hope the migration goes well.

About empty-literal triples, did you use the --skipEmptyLiterals flag?

whikloj commented 6 years ago

Yes I did see that flag and I am going to try using that in this run. 👍