Sequential classifier wrapper for non-sequential classifiers

GoogleCodeExporter commented 9 years ago

I think we have a problem in how we are thinking about using non-sequential
classifiers for sequential tagging tasks.  For starters, the notion that
you are going to do separate feature extraction for the sequential case vs.
the non-sequential case is just silly.  The only difference between the two
is that in the latter case you want add features based on the
classifications of the previous instances.  It would be easy enough to
provide a generic wrapper for all non-sequential classifiers so that we
aren't constantly breaking our code up into "sequential" and
"non-sequential" mode.  

Equally silly, though, is the notion of doing a sequential tagging task
with a non-sequential classifier without a viterbi beam search of some
kind.  This is pretty basic.  

I suggest that we create a sequential classifier that wraps any
non-sequential classifier so that it:
- does something sensible with converting previous classifications into
features
- provides a mechanism to specify a viterbi-style search of some kind.  We
should be able to come with a handful (one-per sequential classifier?) of
these to cover all of our sequential classifiers.

Original issue reported on code.google.com by pvogren@gmail.com on 21 Feb 2009 at 12:36

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

The more I think about this, the more obvious it is to me.  I have been doing a 
bunch
of sequential style tagging - and it makes no sense to be doing the acrobatics 
that I
do to make the feature extraction work correctly for both scenarios.  This is 
going
to simplify my code in a number of places.  (duh!)

Original comment by pvogren@gmail.com on 21 Feb 2009 at 12:53

GoogleCodeExporter commented 9 years ago

my working name for this package is:

org.cleartk.classifier.nonsequential

and in it I have classes such as:

SequentialDataWriter
SequentialClassifier
SequentialBuilder

This does not seem to be a perfect naming scheme - but I do think it 
communicates
well what it is.  If you have suggestions please post them.

Original comment by pvogren@gmail.com on 27 Feb 2009 at 5:22

GoogleCodeExporter commented 9 years ago

It definitely doesn't communicate it to me. I had to go to the ticket and read 
the
whole discussion before to figure out what that meant. I would have called it:

org.cleartk.classifier.sequential

and then called the classes

NonsequentialWrappingDataWriter
NonsequentialWrappingClassifier
NonsequentialWrappingBuilder

The final classifier is a sequential classifier right? If so, calling the 
package
nonsequential seems misleading to me...

Original comment by steven.b...@gmail.com on 27 Feb 2009 at 5:52

GoogleCodeExporter commented 9 years ago

Yes.  The final classifier is a sequential classifier.  That's why I liked the 
name
SequentialClassifier.  Putting it in a package called 'nonsequential' was 
supposed to
evoke the notion we are creating sequential classfiers from nonsequential ones. 
 But
I wasn't sure that was the case which is why I asked.  I don't think your naming
scheme does any better at achieving this though and has essentially the same 
problem
- calling a sequenial classifier a Nonsequential...Classifier seems misleading. 

Philipp - do you have a suggestion?  If not, would you be willing to vote on 
one of
the above.  For my part I think they are about equal (equally bad that is).

Original comment by pvogren@gmail.com on 27 Feb 2009 at 6:49

GoogleCodeExporter commented 9 years ago

Here is an update of what I have done on this ticket thus far.  I initially 
decided
to create a new "NonsequentialSequential" wrapper data writer and classifier.  
At
this point I am only concentrating on making use of previous outcomes as 
features. 
My strategy was to have the data writer inherit from DelegatingDataWriter such 
that
consumeAll is overridden - where an OutcomeFeatureExtractor can be run on the
sequence of instances.  However, it turns out that 
DelegatingDataWriter.consumeAll
(and consume - ticket needed for this) is never called - so this strategy was 
not as
straightforward as I had initially hoped.  

Instead, I decided to directly modify DataWriter_ImplBase and 
Classifier_ImplBase
directly.  I made the following changes:

 * DataWriter_ImplBase: has a member variable called OutcomeFeatureExtractor - this
is initialized in the initialize method via a parameter that gives the class 
name of
the extractor.   The extractor gets all of the outcomes that are available from 
the
instances that come before in the sequence and turns these outcomes into 
features as
it sees fit.  The method consumeAll adds features generated by the
OutcomeFeatureExtractor to each instance.  The outcome feature extractor is
serialized to the output directory in collectionProcessComplete. 
 * BuildJar: writes the serialized outcome feature extractor to the jar file
 * Classifier_ImplBase: has a member variable called OutcomeFeatureExtractor - this
is initialized from the jar file if the jar file contains one.  If not, then the
extractor will be an instance of OutcomeFeatureExtractor which returns an empty 
list
(i.e. does nothing - does not add new features) - this means that existing 
models
will not need to be rebuilt.  Classifier_ImplBase.classifySequence does exactly 
the
same feature extraction done in DataWriter_ImplBase.consumeAll immediately 
before
calling classify() on each list of features.  
 * ClassifierAnnotator always calls the classifySequence method of its classifier
rather than classify on each instance passed in.  

I have created a new feature extractor at
src/org/cleartk/classifier/feature/extractor/outcome/OutcomeFeatureExtractor.jav
a
which does nothing.  I have also created a subclass called
DefaultOutcomeFeatureExtractor which creates what I think to be a reasonable 
default
set of features from available previous outcomes (i.e. the previous outcome, the
outcome previous to that, n-gram features from previous outcomes - all 
configurable).
 Absent any configuration information, the DataWriter_ImplBase will use
DefaultOutcomeFeatureExtractor.  If the handler does not want any additional
features, then it will have to specify to use OutcomeFeatureExtractor.  

I have unit tested the changes to DataWriter_ImplBase and
DefaultOutcomeFeatureExtractor and none of the existing tests broke.  I don't 
know
how to test the changes to Classifier_ImplBase - suggestions are welcome.  I 
will
commit changes if I don't get any negative feedback.

Original comment by pvogren@gmail.com on 2 Mar 2009 at 8:59

GoogleCodeExporter commented 9 years ago

This sounds really complicated, and I don't fully understand what it is you've 
done.
Can you try explaining it again? In particular:

* If I'm writing a new DataWriter, what do I need to do different?
* If I'm writing a new AnnotationHandler, what do I need to do different?
* Before, if you tried to call consume() instead of consumeAll() on a sequential
classifier, you'd get an exception explaining your mistake. Is this still true?

Also, if you haven't, talk to 3P about this because it seems like a bunch of 
changes
to our encoder stuff (and I'm not sure he reads the issues).

Original comment by steven.b...@gmail.com on 3 Mar 2009 at 1:34

GoogleCodeExporter commented 9 years ago

Darn!  I really simplified my approach and it really is pretty straightforward 
- so
apologies for the obtuse description.  Here's the basic assumptions - if you 
call
Classifier_ImplBase.classifySequence or DataWriter_ImplBase.consumeAll, then we 
are
going to assume that you are performing a sequential tagging task.  If you are
calling classify and consume, then you are performing a non-sequential tagging 
task.
 Therefore, if you are an annotation handler performing a sequential tagging task,
then you should call consumeAll.  This is the main driver for this change - I 
was
previously calling either method in my sequential tagging annotation handlers
depending on what classifier I was using.  This was creating unnecessarily 
confusing
code.  

The basic idea is this.  If your annotation handler is calling consumeAll with a
nonsequential classifier, then you want something sensible done with previous
outcomes as features for the current instance.  Instead of making each 
annotation
handler worry about this, Classifier_ImplBase and DataWriter_ImplBase does this 
for
you.  If you don't want it to do anything for you, then you can tell it not to. 
Otherwise, it will run a "previous outcome" feature extractor (which you 
specify - or
use the default one) before passing it to the classify and consume methods.  
Here is
what Classifier_ImplBase.consumeAll looks like now:

{{{
public List<INPUTOUTCOME_TYPE> consumeAll(List<Instance<INPUTOUTCOME_TYPE>> 
instances) {
        List<Object> outcomes = new ArrayList<Object>();
        for(Instance<INPUTOUTCOME_TYPE> instance : instances) {
            List<Feature> instanceFeatures = instance.getFeatures();
            instanceFeatures.addAll(outcomeFeatureExtractor.extractFeatures(outcomes));
            outcomes.add(instance.getOutcome());
            consume(instance);
        }
        return null;
    }
}}}  

In general, this will have very little effect on new DataWriters.  If, like 
most of
our DataWriters, you don't override consumeAll, then the "previous outcome" 
feature
extractor will run.  If you decide to override this method (why?), then it 
won't.

This does not effect sequential classifiers at all, so you will get the same
exception as you have always had.  

Apologies again for my poor description.  I think the code is actually pretty 
simple
and there aren't that many new lines of code.  It would be easy enough to rip 
out if
I commit it and there are objections to it.

Original comment by pvogren@gmail.com on 3 Mar 2009 at 5:28

GoogleCodeExporter commented 9 years ago

How do you make sure that the DataWriter and the ClassifierAnnotator use the 
same
OutcomeFeatureExtractor? Is it saved with the model?

Original comment by steven.b...@gmail.com on 3 Mar 2009 at 6:22

GoogleCodeExporter commented 9 years ago

yes.  the outcome feature extractor is serialized by the data writer, the jar 
builder
adds it to model.jar, and Classifier_ImplBase instantiates it from the jar 
file.

Original comment by pvogren@gmail.com on 4 Mar 2009 at 4:14

GoogleCodeExporter commented 9 years ago

Hmm. That's a little weird - we don't serialize any other feature extractors, 
only
feature encoders. If 3P's okay with it though, I'm okay with it.

Original comment by steven.b...@gmail.com on 4 Mar 2009 at 5:06

GoogleCodeExporter commented 9 years ago

The way I understood it when we talked about it: While the outcome feature 
extractor does extract features from 
the outcomes, it's not a feature extractor in the same sense as all the other 
ones (i.e. it doesn't extract features 
from a CAS, and it's not used from within the annotation handler). Since we 
don't have a "FeatureExtractor" 
interface (or base class) this relationship (or non-relationship) is not 
captured in code.

That said, and given Steve's understandable confusion, I'd consider naming it 
something other than a feature 
extractor, to make that distinction clearer.

Original comment by phwetz...@gmail.com on 4 Mar 2009 at 6:12

GoogleCodeExporter commented 9 years ago

Yeah, I was definitely confused by the fact that it's called a 
FeatureExtractor, and
all our previous references to the term FeatureExtractor meant extracting 
features
from a JCas, while this means extracting features from a list of Outcomes.

Maybe we should distinguish JCasFeatureExtractors from 
OutcomeListFeatureExtractors?

I also wonder about there being only a single OutcomeFeatureExtractor. I can 
imagine
cases where I would want more, for example, in semantic role chunking, I might 
want
both the last tag, e.g. I-ARG2, and the list of completed args, e.g. ARG0 and 
ARG1.
This can be done with a single extractor if we also introduce something like
CombinedExtractor for OutcomeListFeatureExtractors. But maybe it makes more 
sense to
support multiple OutcomeListFeatureExtractors from the start?

Original comment by steven.b...@gmail.com on 4 Mar 2009 at 8:25

GoogleCodeExporter commented 9 years ago

Ok.  I just talked with Philipp about the name.  One reason not to call it a 
feature
extractor is because it widens our definition of a feature extractor to 
anything that
creates features - rather than simply extracting data out of the CAS for 
features
which is what all of the other feature extractors do.  On the other hand, not 
calling
it a feature extractor may cause just as much confusion because we have to come 
with
a name that implies feature creation (maybe FeatureCreator?) without calling it 
the
same thing as all the other things that create features.  What do you think of
OutcomeFeatureCreator?  

That's a nice example.  It is easy enough to allow multiple 
OutcomeFeatureCreators. 
So, I agree.  

How about I make this change and commit it?  It doesn't really affect existing 
code
very much and will be easy enough to rip out if we don't like it.

Original comment by pvogren@gmail.com on 4 Mar 2009 at 9:10

GoogleCodeExporter commented 9 years ago

I think OutcomeFeatureCreator isn't any better than OutcomeFeatureExtractor - 
they
still sound too much the same. So I think leave it as OutcomeFeatureExtractor, 
and we
just need to make sure to document this clearly.

After the change to allow for multiple OutcomeFeatureExtractors, I'm fine with 
you
committing it.

Original comment by steven.b...@gmail.com on 4 Mar 2009 at 9:22

GoogleCodeExporter commented 9 years ago

ok - I committed code that handles the OutcomeFeatureExtractors.  You can 
specify as
many outcome feature extractors as you want which are initialized into an array 
of
extractors.  I added some unit tests which test things on the data writer side 
of
things.  

If we like this code, then I am inclined to close this issue and start a new 
one for
doing viterbi search.

Original comment by pvogren@gmail.com on 6 Mar 2009 at 5:27

GoogleCodeExporter commented 9 years ago

I will open a separate issue for the viterbi search piece remaining with this 
issue.
 I am satisfied with the outcome feature extraction bit.

Original comment by pvogren@gmail.com on 13 Mar 2009 at 8:10

Changed state: Fixed

Tilakkumar / cleartk

Sequential classifier wrapper for non-sequential classifiers #57