Which CAS should CoveredTextExtractor get the text from?

fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk

0 stars 0 forks source link

Which CAS should CoveredTextExtractor get the text from? #296

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Right now, CoveredTextExtractor essentially does:

  public List<Feature> extract(JCas jCas, Annotation focusAnnotation) {
    return Collections.singletonList(new Feature(focusAnnotation.getCoveredText()));
  }

This means we always get the text from the CAS that the focusAnnotation is 
associated with. But if the Annotation does not come from the jCas (e.g. it 
comes from a different view), is this the right behavior? We have two options:

(1) Get the text from the Annotation's CAS (current behavior)
(2) Get the text from the jCas

An argument for (1) is simplicity: it naturally corresponds to the 
.getCoveredText() method.

An argument for (2) is consistency with other extractors - if you pass them a 
different JCas, they'll give you different results. However, (2) will also 
require that we manually re-implement getCoveredText() since there is no 
getCoveredText(JCas).

Original issue reported on code.google.com by steven.b...@gmail.com on 28 Mar 2012 at 10:29

GoogleCodeExporter commented 9 years ago

Interesting.  Is there a use case where we would actually want (2)?  It seems 
pretty unlikely to me that you might have two views with different texts and an 
annotation from one will specify a range of text you want to obtain from the 
other.  I guess it's possible - perhaps the 2nd view might be a lowercased copy 
of the 1st or something like that.  

To be correct, I think we should either implement (2) or we should update (1) 
so that it throws an exception if the jCas passed in is not the same as that of 
the focus annotation.  This may avoid some debugging pain.  

However, my feeling is that this issue can be closed by simply documenting the 
method sufficiently to explain what it does.  If that turns out to be 
insufficient for some actual usecase, then we can change it.

Original comment by phi...@ogren.info on 30 Apr 2012 at 3:41

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The more likely use case for (2) is not that the text is different but that 
it's null in one CAS and not in the other. We actually ran into this issue with 
the relation extractor project.

Original comment by steven.b...@gmail.com on 30 Apr 2012 at 8:43

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 5:34

Added labels: Component-ml, Milestone-1.2
Removed labels: ****

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r3929.

Original comment by steven.b...@gmail.com on 25 Jul 2012 at 7:07

Changed state: Fixed
Added labels: ****
Removed labels: ****