fusepoolP3 / p3-dictionary-matcher-transformer

Dictionary Matcher is P3 transformer for SKOS based entity extraction.
Apache License 2.0
2 stars 3 forks source link

cache for the dictionary #1

Closed jhercher closed 8 years ago

jhercher commented 9 years ago

Hi,

I played around with the dictionary matcher, and I see that it loads the dictionary prior to each request/annotation task, right? Is there a way to persistently store the dictionary? I have a very huge file (see also: https://github.com/fusepool/fusepool-sma/issues/6 ) and it would not be feasible to load it prior to each request.

BR, Johannes

gaborremenyi commented 9 years ago

Hi Johannes, Yes, it loads the dictionary in the same request before annotating the input text. Dictionaries are cached when first invoked based on the URI of the dictionary, so the followup requests will just use the cached instance of the dictionary. The cache is alive as long as the transformer is running. What might be a problem is the size of the file, your initial request might time out, since we are talking about a synchronous transformer. I hope that is not gonna be the case. Anyway, let me know how it goes. Best, Gabor

jhercher commented 9 years ago

Hi Gabor, thanks for the quick reply! I tried to load a 3 GB RDF-File but ran out of memory on my machine with 8 GB (cf. below). I'll will retry this on a bigger machine by the end of the week:


(2) java -Xmx4096M  -jar dictionary-matcher-transformer-v1.0.0-20150219-jar-with-dependencies.jar
=>  
[qtp2039270567-13] WARN org.eclipse.jetty.servlet.ServletHandler - Error for /
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOfRange(Arrays.java:2694)
    at java.lang.String.<init>(String.java:203)
    at org.apache.jena.iri.impl.LexerPath.yytext(LexerPath.java:420)
    at org.apache.jena.iri.impl.AbsLexer.rule(AbsLexer.java:81)
    at org.apache.jena.iri.impl.LexerPath.yylex(LexerPath.java:689)
    at org.apache.jena.iri.impl.AbsLexer.analyse(AbsLexer.java:52)
    at org.apache.jena.iri.impl.Parser.<init>(Parser.java:108)
    at org.apache.jena.iri.impl.IRIImpl.<init>(IRIImpl.java:65)
    at org.apache.jena.iri.impl.AbsIRIFactoryImpl.create(AbsIRIFactoryImpl.java:38)
    at org.apache.jena.iri.impl.IRIFactoryImpl.create(IRIFactoryImpl.java:262)
    at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.checkNamespaceURI(XMLHandler.java:442)
    at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.startPrefixMapping(XMLHandler.java:93)
    at org.apache.xerces.parsers.AbstractSAXParser.startNamespaceMapping(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
    at org.apache.xerces.impl.XMLNamespaceBinder.handleStartElement(Unknown Source)
    at org.apache.xerces.impl.XMLNamespaceBinder.startElement(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
    at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at com.hp.hpl.jena.rdf.arp.impl.RDFXMLParser.parse(RDFXMLParser.java:151)
    at com.hp.hpl.jena.rdf.arp.JenaReader.read(JenaReader.java:168)
    at com.hp.hpl.jena.rdf.arp.JenaReader.read(JenaReader.java:155)
    at com.hp.hpl.jena.rdf.arp.JenaReader.read(JenaReader.java:226)
    at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:274)
    at org.apache.clerezza.rdf.jena.parser.JenaParserProvider.parse(JenaParserProvider.java:68)
    at org.apache.clerezza.rdf.core.serializedform.Parser.parse(Parser.java:240)
    at org.apache.clerezza.rdf.core.serializedform.Parser.parse(Parser.java:193)
    at eu.fusepool.p3.dictionarymatcher.Reader.readDictionary(Reader.java:36)

(2) java -Xmx3096M -XX:+UseG1GC -jar dictionary-matcher-transformer-v1.0.0-20150219-jar-with-dedencies.jar
=> 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Scheduler-386981384"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp390868901-10"
Exception in thread "qtp390868901-17" [qtp390868901-15] WARN org.eclipse.jetty.servlet.ServletHandler - Error for /
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space

I'm glad that the file needs to be loaded only once at startup. Unfortunately, the last used/loaded dictionary overwrites the previous. I can imagine to use multiple instances of the matcher to allow annotation with more than one dictionary, but it would have limits (RAM)?! So, do you plan to introduce a kind of persistence layer to store the tries permanently? Maybe there is already a way, using a Database, or SPARQL Endpoint?!

Cheers, Johannes

retog commented 9 years ago

@gaborremenyi : whyt do you think, could we have an option to use a persistent TDB store?

retog commented 8 years ago

Closing for lack of activity