Open jhercher opened 9 years ago
How much RAM is on the machine in total? The issue is that starting Fusepool Platform with too much RAM takes RAM away from TDB, which is providing the data via memorymap. So you most probably have to reduce the amount of RAM given to the JVM to get TDB to work properly, at least I would try that.
Ah and another thing, what are you trying to do? We now start to get closer with the Fusepool P3 platform which is the EU FP7 successor project. It might be that this is better for some of the use-cases you have. If you describe me what you try to do I can tell you if it would make sense to use P3 instead, which is using Stanbol as a enhancer engine in the background.
It is great to hear that you support Stanbol as a enhancer engine in the background of Fusepool P3. Currently I try to load the GND Authorities File [1], and resources that are related to them from other LOD datasets. I'll send you a mail.
The machine has 16GB of RAM, so there should be enough space. I tried to increase RAM to 6GB but it still failed on startup. Then I tried it with a new instance and a smaller file. I discovered that the graph I loaded the data into with tdbloader (Ver. 1.1.1) was not complete. I can sparql the existing entities, but the added ones did not appear (even not with tdbquery on the graph). The enhancer still works, but with the old data only. I googled a bit and found that this might be related to the Jena/TDB Versions of Clerreza [2].
So the conclusion is: even if the big file goes through, the dictionary annotator will not be able to use it...
[1] http://datendienst.dnb.de/cgi-bin/mabit.pl?userID=opendata&pass=opendata&cmd=login [2] http://mail-archives.apache.org/mod_mbox/stanbol-dev/201211.mbox/%3C50942B3F.8010600@apache.org%3E
@retog do you have an idea what could be wrong with the data in TDB? The link from @jhercher looks pretty old so I doubt Clerezza is still using such an old Jena version?
@jhercher could you point to the data you load so far? Will try to reproduce it here as well.
Hi @ktk,
just load gnd concept identifiers into TDB and query with :
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
PREFIX gndo: <http://d-nb.info/standards/elementset/gnd#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX : <http://d-nb.info/gnd/>
CONSTRUCT{
?s
skos:prefLabel ?uniqname ;
skos:altLabel ?libname ;
skos:altLabel ?ln ;
skos:altLabel ?fullname ;
a gndo:DifferentiatedPerson;
}
WHERE{
?s a gndo:DifferentiatedPerson ;
gndo:preferredNameEntityForThePerson ?bnode1 .
?bnode1 gndo:forename ?fn;
gndo:surname ?ln .
bind (concat(str(?fn), ' ',str(?ln)) as ?fullname)
bind (concat(str(?ln), ', ',str(?fn)) as ?libname)
bind (concat(str(?ln), '_',SUBSTR(str(?s),22,33) ) as ?uniqname)
}
You will get ~16 Mil. Triples of 3Mil Persons from German National Library.
I finally tried to load this on a store, how many triples are in this file? I gave up after 60 million due lack of space on my Laptop. Also you try to enhance on this dataset after it is loaded right? If so I think we have to change the strategy, not sure if this will work properly within TDB.
The recent GND Dump has 8.37.921.683 Triples (As of Feb. 3rd). However, I used the last dump from Nov. 2014. Loading it into fuseki 1.1.1 costs ~14 GB on my disk. Then i used TDB Query to make a subset of "differentiated" persons (cf. query above) which has ~1 GB ( ~16.000.000 triples) for around 3 million persons.
I can query this dataset with tdbquery, and it should work within fuseki as well (though i'd not tried it yet). But, what I try to archive i a kind of gazetteer service using the stanbol/fusepool enhancer service [1], and FP's dictionary annotator as enhancement engine. The service should deliver a GND identifier if a name (i.e. "Johanne Wolfgang von Goethe") is provided. This currently works for smaller datasets like subject vocabs but not for bigger datasets, because I'm not able to load them into the plattform via the graph-upload-form nor tdbloader. I can see that the DBpedia Spotlight Annotation Engine accesses a remote service to do a similar thing with DBpedia entities. This might be the way to do it... Also, the "Entityhub Linking" [2,3] might be a solution, but I did not dived into the stanbol details yet.
[1] https://stanbol.apache.org/docs/trunk/components/enhancer/ [2] https://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking [3] https://stanbol.apache.org/docs/trunk/components/entityhub/
Hi Gabor,
loaded a .nt into an existing graph ( file has ~16.000.000 triples, 1,84GB ) tdbloader --loc=urn%3Ax-localinstance%3A%2Fpersons.graph /Data/out/persons.nt
then I started fusepool with 4 GB RAM
java -Xmx4G -XX:MaxPermSize=512M -Xss512k -XX:+UseCompressedOops -jar launcher-0.1-SNAPSHOT.jar
The startup took ~1 hour and I saw some garbage collector errors (cf. logs below). Fusepool acted very slow and the enhancer was not available.*ERROR* [FelixStartLevel] eu.fusepool.enhancer.engines.dictionaryannotator [eu.fusepool.enhancer.engines.dictionaryannotator.DictionaryAnnotatorEnhancerEngine(12)] The activate method has thrown an exception (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.OutOfMemoryError: GC overhead limit exceeded
How much RAM do you suggest do load a file of this size? Also, It looks as if the file is loaded on each startup, is there a way to prevent this?
Cheers, Johannes