TreebankGoldAnnotator does not recognize document text in default view

GoogleCodeExporter commented 9 years ago

There appears to be a minor bug on line 106 of  
org.cleartk.syntax.constituent.TreebankGoldAnnotator. The fix involves 
replacing the line with the following: 
String docText = docView.getDocumentText();

jCas.getDocumentText() when invoked here seems to always return null even when 
the default view has been populated with text.

What steps will reproduce the problem?
1. Create a pipeline that populates the default view with text prior to calling 
TreebankGoldAnnotator.process()

What is the expected output? What do you see instead?
The docText variable should be set to the document text in the default view on 
line 106 but is instead always set to null causing a call to 
TreebankFormatParser.inferPlainText() on line 115.

What version of the product are you using? On what operating system?
trunk, OSX

Original issue reported on code.google.com by bill.bau...@gmail.com on 11 Mar 2011 at 12:03

GoogleCodeExporter commented 9 years ago

I agree this looks like a bug, but I could not figure out how to write a test 
that provokes the error you're seeing. Could you propose an additional test for 
TreebankGoldReaderAndAnnotatorTest?

Original comment by steven.b...@gmail.com on 11 Mar 2011 at 8:18

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

This may be less of a bug and more of a gap in my understanding of how uimaFIT 
works with views. I'll let you be the judge. The following modified version of 
your test case reproduces the error I observed. (Hopefully this formats nicely)

{{{
@Test
    public void testWhenDefaultViewDocumentTextIsSet() throws Exception {
        String treebankParse = "( (X (NP (NP (NML (NN Complex ) (NN trait )) (NN analysis )) (PP (IN of ) (NP (DT the ) (NN mouse ) (NN striatum )))) (: : ) (S (NP-SBJ (JJ independent ) (NNS QTLs )) (VP (VBP modulate ) (NP (NP (NN volume )) (CC and ) (NP (NN neuron ) (NN number)))))) )";
//      String expectedText = "Complex trait analysis of the mouse striatum: 
independent QTLs modulate volume and neuron number";
        String expectedText = "Complex  trait  analysis  of  the  mouse  striatum  :  independent  QTLs  modulate  volume  and  neuron  number";

        /* set the document text for the default view as it might be set by a collection reader, e.g. {@link FilesCollectionReader} */
        JCas view = ViewCreatorAnnotator.createViewSafely(jCas, CAS.NAME_DEFAULT_SOFA);
        view.setSofaDataString(expectedText, "text/plain");

        AnalysisEngine engine = AnalysisEngineFactory.createPrimitive(TreebankGoldAnnotator.class,
                typeSystemDescription);
        TreebankGoldAnnotator treebankGoldAnnotator = new TreebankGoldAnnotator();
        treebankGoldAnnotator.initialize(engine.getUimaContext());

        JCas tbView = jCas.createView(TreebankConstants.TREEBANK_VIEW);
        tbView.setDocumentText(treebankParse);

//      treebankGoldAnnotator.process(jCas);
        engine.process(jCas);

        JCas goldView = jCas.getView(CAS.NAME_DEFAULT_SOFA);

        FSIndex<Annotation> sentenceIndex = goldView.getAnnotationIndex(Sentence.type);
        assertEquals(1, sentenceIndex.size());

        Sentence firstSentence = JCasUtil.selectByIndex(goldView, Sentence.class, 0);
        assertEquals(expectedText, firstSentence.getCoveredText());
    }
}}}

Note the changes to expectedText (I've simply added extra spaces to make it 
different from what TreebankFormatParser.inferPlainText() produces) and the 
setting of the document text for the default view. The real difference, 
however, is the commenting out of
treebankGoldAnnotator.process(jCas);
and the addition of
engine.process(jCas);

Using treebankGoldAnnotator.process(jCas), the test passes and all is fine.
Using engine.process(jCas), which is what I was using when I ran into this 
issue, results in an exception (org.apache.uima.cas.CASRuntimeException: Data 
for Sofa feature setLocalSofaData() has already been set.)

The suggested fix I mentioned in my initial posting resolves this issue when 
using engine.process(jCas). I'm now wondering if this is not necessarily a bug, 
but a lack of understanding on my part.

Can you perhaps shed some light as to why the separate initialization of a 
TreebankGoldAnnotator (lines 63 and 64 in TreebankGoldReaderAndAnnotatorTest) 
is necessary and what those lines do that 
AnalysisEngineFactory.createPrimitive(TreebankGoldAnnotator.class, 
typeSystemDescription) does not?  

Thanks,
Bill

Original comment by bill.bau...@gmail.com on 11 Mar 2011 at 6:37

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Bill,  Yep - this is a bug.  Thanks for pointing it out and providing the test. 
 This is somewhat confusing because the default CAS is used for the docView.  
So, you might expect that calling jCas.getDocumentText() would work anyways.  
See the javadoc for the SofaCapability annotation definition for an 
explanation.  

I have fixed this in r2794

Original comment by phi...@ogren.info on 13 Mar 2011 at 3:53

Changed state: Fixed
Added labels: ****
Removed labels: ****

fangfangli / cleartk

TreebankGoldAnnotator does not recognize document text in default view #236