dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

MateSemanticRoleLabeler exception #1044

Closed triducnghiem closed 5 years ago

triducnghiem commented 7 years ago

An exception occurs when I ran the MateSemanticRoleLabeler on some specific sentences, one of them can be seen as in the following test:

//this sentence causes an exception

               String text = "If a recipe calls for 2 1/2 cups of flour and you want to make five times the recipe, how much flour do you need?";

                //segmenter
        AnalysisEngineDescription seg = createEngineDescription(StanfordSegmenter.class,
                StanfordSegmenter.PARAM_LANGUAGE, "en");
        // POS tagger and lemmatizer
        AnalysisEngineDescription posTagger = createEngineDescription(MatePosTagger.class);
        AnalysisEngineDescription lemmatizer = createEngineDescription(MateLemmatizer.class);
        // Parser
        AnalysisEngineDescription parser = createEngineDescription(MateParser.class);
        // SRL
        AnalysisEngineDescription SRL = createEngineDescription(MateSemanticRoleLabeler.class);
        AnalysisEngineDescription cc = createEngineDescription(Conll2009Writer.class);
        AnalysisEngine engine = createEngine(createEngineDescription(seg, lemmatizer, posTagger, parser,SRL,cc));
        JCas jcas = engine.newJCas();
        jcas.setDocumentLanguage("en");
        jcas.setDocumentText(text);
        DocumentMetaData metaData= new DocumentMetaData(jcas);
        metaData.setDocumentId("test this!");
        metaData.addToIndexes();
        engine.process(jcas);

The error message is:

org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:412)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344)
    at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
    at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:150)
    at de.idatase.kaggle.preprocessing.PreprocessingPipeline.SRL(PreprocessingPipeline.java:93)
    at de.idatase.kaggle.preprocessing.PreprocessingPipeline.main(PreprocessingPipeline.java:131)
Caused by: java.lang.NumberFormatException: For input string: "_"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at se.lth.cs.srl.corpus.Word.<init>(Word.java:108)
    at se.lth.cs.srl.corpus.Sentence.newDepsOnlySentence(Sentence.java:141)
    at de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler.process(MateSemanticRoleLabeler.java:183)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
    ... 9 more

Apr 21, 2017 10:47:14 AM org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl processAndOutputNewCASes(273)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:412)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344)
    at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
    at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:150)
    at de.idatase.kaggle.preprocessing.PreprocessingPipeline.SRL(PreprocessingPipeline.java:93)
    at de.idatase.kaggle.preprocessing.PreprocessingPipeline.main(PreprocessingPipeline.java:131)
Caused by: java.lang.NumberFormatException: For input string: "_"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at se.lth.cs.srl.corpus.Word.<init>(Word.java:108)
    at se.lth.cs.srl.corpus.Sentence.newDepsOnlySentence(Sentence.java:141)
    at de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler.process(MateSemanticRoleLabeler.java:183)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
    ... 9 more

org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:412)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344)
    at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
    at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:150)
    at de.idatase.kaggle.preprocessing.PreprocessingPipeline.SRL(PreprocessingPipeline.java:93)
    at de.idatase.kaggle.preprocessing.PreprocessingPipeline.main(PreprocessingPipeline.java:131)
Caused by: java.lang.NumberFormatException: For input string: "_"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at se.lth.cs.srl.corpus.Word.<init>(Word.java:108)
    at se.lth.cs.srl.corpus.Sentence.newDepsOnlySentence(Sentence.java:141)
    at de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler.process(MateSemanticRoleLabeler.java:183)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
    ... 9 more

Besides, as far as I know, the pipeline will be terminated whenever exception occurs during processing a CAS document? Is there anyway to keep it running and logging the unprocessed CAS and error messages to somewhere (log file or ERRSTD)?

reckart commented 7 years ago

The uimaFIT SimplePipeline doesn't support running on and logging errors. You could clone the SimplePipeline and just catch exceptions and process on. The code is - well - simple ;)

triducnghiem commented 7 years ago

Thanks Richard, I found it easy too :-). By the way, I found the problem. It is because sometime, the Stanford tokenizer returns a token with a whitespace, in this case: "2 1/2" and it causes a problem to the WHITESPACE_PATTERN.split(blabla) in the Sentence.java (by matetools). Matetools doesn't provide tokenizer itself, and therefore, I used the StanfordTokenizer provided in StanfordSegmenter (by DKPRO), which by default, does not support setting up some parameter for the tokenizer, for example: normalizeSpace (As far as I saw in the source code).

reckart commented 7 years ago

That's right. These parameters are depending on which tokenizer is internally used and that depends on the document language / language parameter. The current code already contains a additionalOptions field in the StanfordSegmenter class which seems to be a step towards allowing users to provider such parameters - however, it is presently unused... should be fixed...

reckart commented 5 years ago

I think this can be closed.