DARIAH-DE / DARIAH-DKPro-Wrapper

Wrapper for DKPro Core to extract lingustic information from books.
http://dariah-de.github.io/DARIAH-DKPro-Wrapper
Apache License 2.0
16 stars 8 forks source link

OOM during dependency parsing #17

Closed thvitt closed 8 years ago

thvitt commented 8 years ago

Running the wrapper against Zesen,-Philipp-von_Simson.txt using this config file causes an OutOfMemory exception using 6G heap, see below for command line and output.

A bisection session shows the error was introduced by 5037c32a447cf6cec46c2b2f5ecb9ee83a46ceef

% java -Xmx6G -Xms6G -jar ddw-0.4.1.jar -input ../../romankorpus/Zesen,-Philipp-von_Simson.txt -output target
log4j:WARN No appenders could be found for logger (org.apache.commons.configuration.PropertiesConfiguration).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Input: ../../romankorpus/Zesen,-Philipp-von_Simson.txt
Output: target
Config: configs/default.properties, configs/default_en.properties
Language: en
Reader: Text
Start Quote: "„»
Paragraph Single Line Break: false
Segmenter: true
Segmenter: class de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSegmenter
POS-Tagger: true
POS-Tagger: class de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger
POS-Tagger: executablePath, /home/tv/git/dkpro-exp/treetagger/bin/tree-tagger, modelLocation, /home/tv/git/dkpro-exp/treetagger/german-par-linux-3.2-utf8.bin, modelEncoding, utf-8
Lemmatizer: false
Lemmatizer: class de.tudarmstadt.ukp.dkpro.core.matetools.MateLemmatizer
Chunker: true
Chunker: class de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunker
Morphology Tagging: false
Morphology Tagging: class de.tudarmstadt.ukp.dkpro.core.matetools.MateMorphTagger
Named Entity Recognition: true
Named Entity Recognition: class de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer
Dependency Parsing: true
Dependency Parsing: class de.tudarmstadt.ukp.dkpro.core.matetools.MateParser
Dependency Parsing: writeConstituency, false
Constituency Parsing: false
Constituency Parsing: class de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordParser
Constituency Parsing: writeDependency, false
Semantic Role Labeling: false
Semantic Role Labeling: class de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler
Process 1 files

Start running the pipeline (this may take a while)...
Process file: Zesen,-Philipp-von_Simson.txt
26.42.38   is2.parser.Parser -1:readModel ->           Reading data started
26.42.60   is2.data.Cluster -1:<init> ->               Read cluster with 0 words 
26.53.110  is2.parser.ParametersFloat -1:read ->       read parameters 134217727 not zero 19957525
26.53.111  is2.parser.Parser -1:readModel ->           parsing -- li size 134217727
26.53.114  is2.parser.Parser -1:readModel ->           Stacking false
26.53.114  is2.parser.Extractor -1:initStat ->         mult  (d4) 
Used parser   class is2.parser.Parser
Creation date 2012.11.02 14:33:53
Training data CoNLL2009-ST-English-ALL.txt.crossannotated
Iterations    10 Used sentences 10000000
Cluster       null
26.53.116  is2.parser.Parser -1:readModel ->           Reading data finnished
26.53.117  is2.parser.Extractor -1:initStat ->         mult  (d4) 
Out of Memory at file: /home/tv/DARIAH/ddw/dariah-dkpro-wrapper-0.4.3/../../romankorpus/Zesen,-Philipp-von_Simson.txt
---- DONE -----
thvitt commented 8 years ago

The incomplete sentence has strange sentence numbering.

nur in 0.4.2-Zesen,-Philipp-von_Simson.txt.csv vorhanden: satz id 36 bis 38, zwischendurch satz id = 1

_   2   36  2131    7236    7243    Hiermit hiermit ADV PAV _   _   _   _   _   0   _   _   _
_   2   36  2132    7244    7247    mus mus V   VVFIN   _   _   _   _   _   0   _   _   _
_   2   36  2133    7248    7251    ich ich PR  PPER    _   _   _   _   _   0   _   _   _
_   2   36  2134    7252    7260    schlüßen  schlüßen  NN  NN  _   _   _   _   _   0   _   _   _
_   2   36  2135    7261    7262    /   /   PUNC    $(  _   _   _   _   _   0   _   _   _
_   2   36  2136    7263    7266    und und CONJ    KON _   _   _   _   _   0   _   _   _
_   2   36  2137    7267    7273    darbei  darbei  ADV ADV _   _   _   _   _   0   _   _   _
_   2   36  2138    7274    7277    den die ART ART _   _   _   _   _   0   _   _   _
_   2   36  2139    7278    7291    guhthertzigen   guhthertzigen   ADJ ADJA    _   _   _   _   _   0   _   _   _
_   2   36  2140    7292    7297    Leser   Leser   NN  NN  _   _   _   _   _   0   _   _   _
_   2   36  2141    7298    7306    ersuchen    ersuchen    V   VVFIN   _   _   _   _   _   0   _   _   _
_   2   36  2142    7307    7311    mich    ich PR  PRF _   _   _   _   _   0   _   _   _
_   2   36  2143    7312    7314    in  in  PP  APPR    _   _   _   _   _   0   _   _   _
_   2   36  2144    7315    7319    sein    sein    PR  PPOSAT  _   _   _   _   _   0   _   _   _
_   2   36  2145    7320    7326    Gebäht Gebäht NN  NN  _   _   _   _   _   0   _   _   _
_   2   36  2146    7327    7328    /   /   PUNC    $(  _   _   _   _   _   0   _   _   _
_   2   36  2147    7329    7331    zu  zu  PP  APPR    _   _   _   _   _   0   _   _   _
_   2   36  2148    7332    7347    wiedererlangung Wiedererlangung NN  NN  _   _   _   _   _   0   _   _   _
_   2   36  2149    7348    7354    meiner  mein    PR  PPOSAT  _   _   _   _   _   0   _   _   _
_   2   36  2150    7355    7365    Gesundheit  Gesundheit  NN  NN  _   _   _   _   _   0   _   _   _
_   2   36  2151    7366    7367    /   /   PUNC    $(  _   _   _   _   _   0   _   _   _
_   2   36  2152    7368    7384    miteinzuschlüßen  miteinzuschlüßen  V   VVFIN   _   _   _   _   _   0   _   _   _
_   2   36  2153    7384    7385    .   .   PUNC    $.  _   _   _   _   _   0   _   _   _
_   2   1   1098    7386    7391    Dafür  dafür  ADV PAV _   _   _   _   _   0   _   _   _
_   2   1   1099    7392    7395    ihm er  PR  PPER    _   _   _   _   _   0   _   _   _
_   2   1   1100    7396    7399    dan dan V   VVFIN   _   _   _   _   _   0   _   _   _
_   2   1   1101    7400    7404    alle    alle    PR  PIAT    _   _   _   _   _   0   _   _   _
_   2   1   1102    7405    7410    meine   mein    PR  PPOSAT  _   _   _   _   _   0   _   _   _
_   2   1   1103    7411    7424    Geflissenheit   Geflissenheit   NN  NN  _   _   _   _   _   0   _   _   _
_   2   1   1104    7425    7434    gewiedmet   gewiedmet   V   VVPP    _   _   _   _   _   0   _   _   _
_   2   1   1105    7435    7439    sein    sein    V   VAINF   _   _   _   _   _   0   _   _   _
_   2   1   1106    7440    7443    sol Sol NN  NN  _   _   _   _   _   0   _   _   _
_   2   1   1107    7444    7445    /   /   PUNC    $(  _   _   _   _   _   0   _   _   _
_   2   1   1108    7446    7448    so  so  ADV ADV _   _   _   _   _   0   _   _   _
_   2   1   1109    7449    7454    lange   lange   ADV ADV _   _   _   _   _   0   _   _   _
_   2   1   1110    7455    7456    /   /   PUNC    $(  _   _   _   _   _   0   _   _   _
_   2   1   1111    7457    7460    als als CONJ    KOKOM   _   _   _   _   _   0   _   _   _
_   2   1   1112    7461    7464    ich ich PR  PPER    _   _   _   _   _   0   _   _   _
_   2   1   1113    7465    7468    bin sein    V   VAFIN   _   _   _   _   _   0   _   _   _
_   2   1   1114    7469    7472    und und CONJ    KON _   _   _   _   _   0   _   _   _
_   2   1   1115    7473    7479    heisse  heißen V   VVFIN   _   _   _   _   _   0   _   _   _
_   3   37  2154    7524    7527    Das die ART ART _   nom|sg|neut _   _   _   0   _   _   _
_   3   37  2155    7528    7533    erste   erst    ADJ ADJA    _   nom|sg|neut|pos _   _   _   0   _   _   _
_   3   37  2156    7534    7538    Buch    Buch    NN  NN  _   nom|sg|neut _   _   _   0   _   _   _
_   3   37  2157    7538    7539    .   .   PUNC    $.  _   _   _   _   _   0   _   _   _
_   3   38  2158    7541    7544    Die die ART ART _   _   _   _   _   0   _   _   _
_   3   38  2159    7545    7546    (   (   PUNC    $(  _   _   _   _   _   0   _   _   _
_   3   38  2160    7546    7547    1   1   CARD    CARD    _   _   _   _   _   0   _   _   _
_   3   38  2161    7547    7548    )   )   PUNC    $(  _   _   _   _   _   0   _   _   _
_   3   38  2162    7549    7559    Einteilung  Einteilung  NN  NN  _   dat|sg|fem  _   _   _   0   _   _   _
_   3   38  2163    7559    7560    .   .   PUNC    $.  _   _   _   _   _   0   _   _   _
thvitt commented 8 years ago

Stack trace of the OOM:

java.lang.OutOfMemoryError: Java heap space
        at is2.data.DataFES.(Unknown Source)
        at is2.parser.Pipe.fillVector(Unknown Source)
        at is2.parser.Parser.parse(Unknown Source)
        at is2.parser.Parser.apply(Unknown Source)
        at de.tudarmstadt.ukp.dkpro.core.matetools.MateParser.process(MateParser.java:226)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
        at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
        at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:150)
        at de.tudarmstadt.ukp.dariah.pipeline.RunPipeline.main(RunPipeline.java:635)
thvitt commented 8 years ago

In the master branch, the first sentence returned from JCasUtil.select(jcas, Sentence.class) in this minimal example ranges from offset 0 to 3378 (that's the whole first paragraph), thus covering far too much. Older versions correctly return only the span up to the first full-stop.