Closed GoogleCodeExporter closed 8 years ago
[deleted comment]
[deleted comment]
The more I think about this, the more obvious it is to me. I have been doing a
bunch
of sequential style tagging - and it makes no sense to be doing the acrobatics
that I
do to make the feature extraction work correctly for both scenarios. This is
going
to simplify my code in a number of places. (duh!)
Original comment by pvogren@gmail.com
on 21 Feb 2009 at 12:53
my working name for this package is:
org.cleartk.classifier.nonsequential
and in it I have classes such as:
SequentialDataWriter
SequentialClassifier
SequentialBuilder
This does not seem to be a perfect naming scheme - but I do think it
communicates
well what it is. If you have suggestions please post them.
Original comment by pvogren@gmail.com
on 27 Feb 2009 at 5:22
It definitely doesn't communicate it to me. I had to go to the ticket and read
the
whole discussion before to figure out what that meant. I would have called it:
org.cleartk.classifier.sequential
and then called the classes
NonsequentialWrappingDataWriter
NonsequentialWrappingClassifier
NonsequentialWrappingBuilder
The final classifier is a sequential classifier right? If so, calling the
package
nonsequential seems misleading to me...
Original comment by steven.b...@gmail.com
on 27 Feb 2009 at 5:52
Yes. The final classifier is a sequential classifier. That's why I liked the
name
SequentialClassifier. Putting it in a package called 'nonsequential' was
supposed to
evoke the notion we are creating sequential classfiers from nonsequential ones.
But
I wasn't sure that was the case which is why I asked. I don't think your naming
scheme does any better at achieving this though and has essentially the same
problem
- calling a sequenial classifier a Nonsequential...Classifier seems misleading.
Philipp - do you have a suggestion? If not, would you be willing to vote on
one of
the above. For my part I think they are about equal (equally bad that is).
Original comment by pvogren@gmail.com
on 27 Feb 2009 at 6:49
Here is an update of what I have done on this ticket thus far. I initially
decided
to create a new "NonsequentialSequential" wrapper data writer and classifier.
At
this point I am only concentrating on making use of previous outcomes as
features.
My strategy was to have the data writer inherit from DelegatingDataWriter such
that
consumeAll is overridden - where an OutcomeFeatureExtractor can be run on the
sequence of instances. However, it turns out that
DelegatingDataWriter.consumeAll
(and consume - ticket needed for this) is never called - so this strategy was
not as
straightforward as I had initially hoped.
Instead, I decided to directly modify DataWriter_ImplBase and
Classifier_ImplBase
directly. I made the following changes:
* DataWriter_ImplBase: has a member variable called OutcomeFeatureExtractor - this
is initialized in the initialize method via a parameter that gives the class
name of
the extractor. The extractor gets all of the outcomes that are available from
the
instances that come before in the sequence and turns these outcomes into
features as
it sees fit. The method consumeAll adds features generated by the
OutcomeFeatureExtractor to each instance. The outcome feature extractor is
serialized to the output directory in collectionProcessComplete.
* BuildJar: writes the serialized outcome feature extractor to the jar file
* Classifier_ImplBase: has a member variable called OutcomeFeatureExtractor - this
is initialized from the jar file if the jar file contains one. If not, then the
extractor will be an instance of OutcomeFeatureExtractor which returns an empty
list
(i.e. does nothing - does not add new features) - this means that existing
models
will not need to be rebuilt. Classifier_ImplBase.classifySequence does exactly
the
same feature extraction done in DataWriter_ImplBase.consumeAll immediately
before
calling classify() on each list of features.
* ClassifierAnnotator always calls the classifySequence method of its classifier
rather than classify on each instance passed in.
I have created a new feature extractor at
src/org/cleartk/classifier/feature/extractor/outcome/OutcomeFeatureExtractor.jav
a
which does nothing. I have also created a subclass called
DefaultOutcomeFeatureExtractor which creates what I think to be a reasonable
default
set of features from available previous outcomes (i.e. the previous outcome, the
outcome previous to that, n-gram features from previous outcomes - all
configurable).
Absent any configuration information, the DataWriter_ImplBase will use
DefaultOutcomeFeatureExtractor. If the handler does not want any additional
features, then it will have to specify to use OutcomeFeatureExtractor.
I have unit tested the changes to DataWriter_ImplBase and
DefaultOutcomeFeatureExtractor and none of the existing tests broke. I don't
know
how to test the changes to Classifier_ImplBase - suggestions are welcome. I
will
commit changes if I don't get any negative feedback.
Original comment by pvogren@gmail.com
on 2 Mar 2009 at 8:59
This sounds really complicated, and I don't fully understand what it is you've
done.
Can you try explaining it again? In particular:
* If I'm writing a new DataWriter, what do I need to do different?
* If I'm writing a new AnnotationHandler, what do I need to do different?
* Before, if you tried to call consume() instead of consumeAll() on a sequential
classifier, you'd get an exception explaining your mistake. Is this still true?
Also, if you haven't, talk to 3P about this because it seems like a bunch of
changes
to our encoder stuff (and I'm not sure he reads the issues).
Original comment by steven.b...@gmail.com
on 3 Mar 2009 at 1:34
Darn! I really simplified my approach and it really is pretty straightforward
- so
apologies for the obtuse description. Here's the basic assumptions - if you
call
Classifier_ImplBase.classifySequence or DataWriter_ImplBase.consumeAll, then we
are
going to assume that you are performing a sequential tagging task. If you are
calling classify and consume, then you are performing a non-sequential tagging
task.
Therefore, if you are an annotation handler performing a sequential tagging task,
then you should call consumeAll. This is the main driver for this change - I
was
previously calling either method in my sequential tagging annotation handlers
depending on what classifier I was using. This was creating unnecessarily
confusing
code.
The basic idea is this. If your annotation handler is calling consumeAll with a
nonsequential classifier, then you want something sensible done with previous
outcomes as features for the current instance. Instead of making each
annotation
handler worry about this, Classifier_ImplBase and DataWriter_ImplBase does this
for
you. If you don't want it to do anything for you, then you can tell it not to.
Otherwise, it will run a "previous outcome" feature extractor (which you
specify - or
use the default one) before passing it to the classify and consume methods.
Here is
what Classifier_ImplBase.consumeAll looks like now:
{{{
public List<INPUTOUTCOME_TYPE> consumeAll(List<Instance<INPUTOUTCOME_TYPE>>
instances) {
List<Object> outcomes = new ArrayList<Object>();
for(Instance<INPUTOUTCOME_TYPE> instance : instances) {
List<Feature> instanceFeatures = instance.getFeatures();
instanceFeatures.addAll(outcomeFeatureExtractor.extractFeatures(outcomes));
outcomes.add(instance.getOutcome());
consume(instance);
}
return null;
}
}}}
In general, this will have very little effect on new DataWriters. If, like
most of
our DataWriters, you don't override consumeAll, then the "previous outcome"
feature
extractor will run. If you decide to override this method (why?), then it
won't.
This does not effect sequential classifiers at all, so you will get the same
exception as you have always had.
Apologies again for my poor description. I think the code is actually pretty
simple
and there aren't that many new lines of code. It would be easy enough to rip
out if
I commit it and there are objections to it.
Original comment by pvogren@gmail.com
on 3 Mar 2009 at 5:28
How do you make sure that the DataWriter and the ClassifierAnnotator use the
same
OutcomeFeatureExtractor? Is it saved with the model?
Original comment by steven.b...@gmail.com
on 3 Mar 2009 at 6:22
yes. the outcome feature extractor is serialized by the data writer, the jar
builder
adds it to model.jar, and Classifier_ImplBase instantiates it from the jar
file.
Original comment by pvogren@gmail.com
on 4 Mar 2009 at 4:14
Hmm. That's a little weird - we don't serialize any other feature extractors,
only
feature encoders. If 3P's okay with it though, I'm okay with it.
Original comment by steven.b...@gmail.com
on 4 Mar 2009 at 5:06
The way I understood it when we talked about it: While the outcome feature
extractor does extract features from
the outcomes, it's not a feature extractor in the same sense as all the other
ones (i.e. it doesn't extract features
from a CAS, and it's not used from within the annotation handler). Since we
don't have a "FeatureExtractor"
interface (or base class) this relationship (or non-relationship) is not
captured in code.
That said, and given Steve's understandable confusion, I'd consider naming it
something other than a feature
extractor, to make that distinction clearer.
Original comment by phwetz...@gmail.com
on 4 Mar 2009 at 6:12
Yeah, I was definitely confused by the fact that it's called a
FeatureExtractor, and
all our previous references to the term FeatureExtractor meant extracting
features
from a JCas, while this means extracting features from a list of Outcomes.
Maybe we should distinguish JCasFeatureExtractors from
OutcomeListFeatureExtractors?
I also wonder about there being only a single OutcomeFeatureExtractor. I can
imagine
cases where I would want more, for example, in semantic role chunking, I might
want
both the last tag, e.g. I-ARG2, and the list of completed args, e.g. ARG0 and
ARG1.
This can be done with a single extractor if we also introduce something like
CombinedExtractor for OutcomeListFeatureExtractors. But maybe it makes more
sense to
support multiple OutcomeListFeatureExtractors from the start?
Original comment by steven.b...@gmail.com
on 4 Mar 2009 at 8:25
Ok. I just talked with Philipp about the name. One reason not to call it a
feature
extractor is because it widens our definition of a feature extractor to
anything that
creates features - rather than simply extracting data out of the CAS for
features
which is what all of the other feature extractors do. On the other hand, not
calling
it a feature extractor may cause just as much confusion because we have to come
with
a name that implies feature creation (maybe FeatureCreator?) without calling it
the
same thing as all the other things that create features. What do you think of
OutcomeFeatureCreator?
That's a nice example. It is easy enough to allow multiple
OutcomeFeatureCreators.
So, I agree.
How about I make this change and commit it? It doesn't really affect existing
code
very much and will be easy enough to rip out if we don't like it.
Original comment by pvogren@gmail.com
on 4 Mar 2009 at 9:10
I think OutcomeFeatureCreator isn't any better than OutcomeFeatureExtractor -
they
still sound too much the same. So I think leave it as OutcomeFeatureExtractor,
and we
just need to make sure to document this clearly.
After the change to allow for multiple OutcomeFeatureExtractors, I'm fine with
you
committing it.
Original comment by steven.b...@gmail.com
on 4 Mar 2009 at 9:22
ok - I committed code that handles the OutcomeFeatureExtractors. You can
specify as
many outcome feature extractors as you want which are initialized into an array
of
extractors. I added some unit tests which test things on the data writer side
of
things.
If we like this code, then I am inclined to close this issue and start a new
one for
doing viterbi search.
Original comment by pvogren@gmail.com
on 6 Mar 2009 at 5:27
I will open a separate issue for the viterbi search piece remaining with this
issue.
I am satisfied with the outcome feature extraction bit.
Original comment by pvogren@gmail.com
on 13 Mar 2009 at 8:10
Original issue reported on code.google.com by
pvogren@gmail.com
on 21 Feb 2009 at 12:36