OOM during dependency parsing #17

Closed 8 years ago

thvitt commented 8 years ago

Running the wrapper against Zesen,-Philipp-von_Simson.txt using this config file causes an OutOfMemory exception using 6G heap, see below for command line and output.

A bisection session shows the error was introduced by 5037c32a447cf6cec46c2b2f5ecb9ee83a46ceef

% java -Xmx6G -Xms6G -jar ddw-0.4.1.jar -input ../../romankorpus/Zesen,-Philipp-von_Simson.txt -output target
log4j:WARN No appenders could be found for logger (org.apache.commons.configuration.PropertiesConfiguration).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Input: ../../romankorpus/Zesen,-Philipp-von_Simson.txt
Output: target
Config: configs/default.properties, configs/default_en.properties
Language: en
Reader: Text
Start Quote: "„»
Paragraph Single Line Break: false
Segmenter: true
Segmenter: class de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSegmenter
POS-Tagger: true
POS-Tagger: class de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger
POS-Tagger: executablePath, /home/tv/git/dkpro-exp/treetagger/bin/tree-tagger, modelLocation, /home/tv/git/dkpro-exp/treetagger/german-par-linux-3.2-utf8.bin, modelEncoding, utf-8
Lemmatizer: false
Lemmatizer: class de.tudarmstadt.ukp.dkpro.core.matetools.MateLemmatizer
Chunker: true
Chunker: class de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunker
Morphology Tagging: false
Morphology Tagging: class de.tudarmstadt.ukp.dkpro.core.matetools.MateMorphTagger
Named Entity Recognition: true
Named Entity Recognition: class de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer
Dependency Parsing: true
Dependency Parsing: class de.tudarmstadt.ukp.dkpro.core.matetools.MateParser
Dependency Parsing: writeConstituency, false
Constituency Parsing: false
Constituency Parsing: class de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordParser
Constituency Parsing: writeDependency, false
Semantic Role Labeling: false
Semantic Role Labeling: class de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler
Process 1 files

Start running the pipeline (this may take a while)...
Process file: Zesen,-Philipp-von_Simson.txt
26.42.38   is2.parser.Parser -1:readModel ->           Reading data started
26.42.60   is2.data.Cluster -1:<init> ->               Read cluster with 0 words 
26.53.110  is2.parser.ParametersFloat -1:read ->       read parameters 134217727 not zero 19957525
26.53.111  is2.parser.Parser -1:readModel ->           parsing -- li size 134217727
26.53.114  is2.parser.Parser -1:readModel ->           Stacking false
26.53.114  is2.parser.Extractor -1:initStat ->         mult  (d4) 
Used parser   class is2.parser.Parser
Creation date 2012.11.02 14:33:53
Training data CoNLL2009-ST-English-ALL.txt.crossannotated
Iterations    10 Used sentences 10000000
Cluster       null
26.53.116  is2.parser.Parser -1:readModel ->           Reading data finnished
26.53.117  is2.parser.Extractor -1:initStat ->         mult  (d4) 
Out of Memory at file: /home/tv/DARIAH/ddw/dariah-dkpro-wrapper-0.4.3/../../romankorpus/Zesen,-Philipp-von_Simson.txt
---- DONE -----
thvitt commented 8 years ago

The incomplete sentence has strange sentence numbering.

nur in 0.4.2-Zesen,-Philipp-von_Simson.txt.csv vorhanden: satz id 36 bis 38, zwischendurch satz id = 1

thvitt commented 8 years ago

Stack trace of the OOM:

java.lang.OutOfMemoryError: Java heap space
        at is2.data.DataFES.(Unknown Source)
        at is2.parser.Pipe.fillVector(Unknown Source)
        at is2.parser.Parser.parse(Unknown Source)
        at is2.parser.Parser.apply(Unknown Source)
        at de.tudarmstadt.ukp.dkpro.core.matetools.MateParser.process(MateParser.java:226)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
        at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
        at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:150)
        at de.tudarmstadt.ukp.dariah.pipeline.RunPipeline.main(RunPipeline.java:635)
thvitt commented 8 years ago

In the master branch, the first sentence returned from JCasUtil.select(jcas, Sentence.class) in this minimal example ranges from offset 0 to 3378 (that's the whole first paragraph), thus covering far too much. Older versions correctly return only the span up to the first full-stop.