Closed GoogleCodeExporter closed 9 years ago
It has occurred to me a number of times recently that the evaluation code
should sit in another module/project. It seems that most evaluation code we
create could be re-used for inter-annotator agreement and in that context there
probably isn't any ML involved. So, I would vote for cleartk-eval.
Is your first concern related to performance?
I agree that the version of the eval code shouldn't be the same as cleartk-ml.
That said, I actually have 2 implementations of the interfaces - 1 for
part-of-speech tagging (surprise!) and another for sentence segmentation. I
also have code that the needs to be refactored for evaluating coordination
structures. So, I have a bit more confidence that the interfaces as they are
usable and I do have a fair bit of code lying around that will be added soon.
I'm actually working on this now, sort of.
I agree that a FilesCorpusReader makes sense to have - but you might also
consider using XmiCorpusFactory which I think is a better approach.
Original comment by phi...@ogren.info
on 7 Feb 2011 at 9:44
My concern is partly performance and partly ease of implementing the API. I
feel like in most cases people have a single pipeline, and they just switch
their cleartk annotator to isTraining=true during training and their cleartk
annotator to isTraining=false during testing. I think we could accomplish this
by working with a single pipeline directly and having the user implement
methods like:
configureForDataWriting(AnalysisEngine pipeline);
I also feel like the EvaluationFactory is not an ideal API. For every
evaluation I can imagine, I just want to implement a method like:
public void evaluation(JCas goldView, JCas systemView);
and maybe write results to stdout and/or a directory in
collectionProcessComplete(). This should work exactly the same way for a cross
validation - the evaluation infrastructure should stream the correct CASes from
each fold through this one single annotator (not create the same evaluation
annotator many times).
Original comment by steven.b...@gmail.com
on 8 Feb 2011 at 7:23
For comparison, here's the code that I ended up writing for cross-validation,
which never re-creates an AnalysisEngine, and only reconfigures the
AnalysisEngines that have to be reconfigured.
public static void crossValidate(int totalFolds, EvaluationPipeline pipeline) throws Exception {
// get the readers and engines
CollectionReader reader = pipeline.getCollectionReader();
AnalysisEngine[] trainEngines = pipeline.getTrainAnalysisEngines();
AnalysisEngine[] testEngines = pipeline.getTestAnalysisEngines();
// collect common meta data
List<ResourceMetaData> metaData = new ArrayList<ResourceMetaData>();
metaData.add(pipeline.getCollectionReader().getMetaData());
for (Resource resource : pipeline.getTrainAnalysisEngines()) {
metaData.add(resource.getMetaData());
}
for (Resource resource : pipeline.getTestAnalysisEngines()) {
metaData.add(resource.getMetaData());
}
// repeat for each cross-validation fold...
for (int fold = 0; fold < totalFolds; ++fold) {
pipeline.configureForTraining(fold, totalFolds);
// run through the training files
CAS cas = CasCreationUtils.createCas(metaData);
while (reader.hasNext()) {
cas.reset();
reader.getNext(cas);
for (AnalysisEngine engine : trainEngines) {
engine.process(cas);
}
}
// close the training engines
for (AnalysisEngine engine : trainEngines) {
engine.collectionProcessComplete();
}
// train the model
pipeline.train(fold, totalFolds);
// configure for testing
pipeline.configureForTesting(fold, totalFolds);
// run through the testing files
cas = CasCreationUtils.createCas(metaData);
while (reader.hasNext()) {
cas.reset();
reader.getNext(cas);
for (AnalysisEngine engine : testEngines) {
engine.process(cas);
}
}
// for testing, only a "batch" is done, not the whole collection
for (AnalysisEngine engine : testEngines) {
engine.batchProcessComplete();
}
}
// close the test engines after all folds are complete
for (AnalysisEngine engine : testEngines) {
engine.collectionProcessComplete();
}
}
public static interface EvaluationPipeline {
public CollectionReader getCollectionReader();
public AnalysisEngine[] getTrainAnalysisEngines();
public AnalysisEngine[] getTestAnalysisEngines();
public void configureForTraining(int fold, int totalFolds)
throws ResourceConfigurationException;
public void train(int fold, int totalFolds) throws Exception;
public void configureForTesting(int fold, int totalFolds) throws ResourceConfigurationException;
}
As an example, here's what my implementation of configureForTesting looks like:
public void configureForTesting(int fold, int totalFolds) throws ResourceConfigurationException {
// collect the test names for this fold
List<String> foldTestNamesList = new ArrayList<String>();
for (int i = 0; i < this.trainNames.size(); ++i) {
if (i % totalFolds == fold) {
foldTestNamesList.add(this.trainNames.get(i));
}
}
// configure for testing
reader.setConfigParameterValue(
FilesCollectionReader.PARAM_FILE_NAMES,
foldTestNamesList.toArray(new String[foldTestNamesList.size()]));
reader.reconfigure();
goldAnnotator.setConfigParameterValue(StoryEventGoldAnnotator.PARAM_VIEW_NAME, "GoldView");
goldAnnotator.reconfigure();
systemAnnotator.setConfigParameterValue(CleartkAnnotator.PARAM_IS_TRAINING, false);
systemAnnotator.setConfigParameterValue(
GenericJarClassifierFactory.PARAM_CLASSIFIER_JAR_PATH,
new File(this.getFoldDirectory(fold, totalFolds), "model.jar").getPath());
systemAnnotator.reconfigure();
}
Basically I make some setConfigParameterValue calls followed by reconfigure
calls for the annotators that need it. (If you look closely, you'll see I had
to add a parameter CleartkAnnotator.PARAM_IS_TRAINING, but I think that's
probably a useful parameter that we should add anyway.)
Original comment by steven.b...@gmail.com
on 8 Feb 2011 at 4:30
So, here are a few things to consider:
The difference between the pipeline that trains and the one that classifies
might be quite small as you mention. It might also be quite large. You might
have a lot of gold-standard data that you use for training (like tokens,
sentences, pos tags, depedendencies, and semantic roles, for example.) When
you classify you may not assume any of these things. For my coordination work
I had three different corpora that had gold-standard coordination structures
but differing amounts of other kinds of gold-standard data. One corpus had
nothing but coordination structures (i.e. no tokens, no sentences, no
part-of-speech tags.) And, of course, for the two corpora that had more data I
experimented with using an automatic tokenizer and gold-standard tokens.
Another difference between the two pipelines is that one operated on the gold
view and the other worked on the system view and was begun with
ViewTextCopierAnnotator(sp). I also had some pre-computed part-of-speech tags
that were derived from "leave-one-file-out" tagger training across 36 files.
When I ran my training pipeline and I wanted to use these tags the pre-computed
tags were copied over the gold-standard tags for that run. When I ran the
classifying pipeline I would copy the pre-computed tags from the gold view to
the tokens in system view. So, while the two pipelines have a lot of
similarities and they used a lot of the same convenience methods I created for
setting them up - there were enough differences between the two that I was glad
they were created by separate methods and I would not like to see them put back
together.
Larry's group does not use views to separate gold annotations from system
annotations - they do this using the type system. They have a notion of an
annotation set. So, your proposed interface will not work for them. I like
the idea of having a default implementation that has something like this.
I do not like the idea of pushing all folds through the same annotator. With
the current interfaces it would be trivial to extend
org.cleartk.evaluation.Evaluation (see runCrossValidation) so that you could
run one fold at a time. This is nice for parallelization. It's great when you
can run an entire cross-validation experiment on a single machine in a single
run. But we should make it really easy to run each fold on 10 different
machines - and not force sequential processing where it is not required. Also,
it's nice to run a single fold - for whatever reason.
I also like having each fold have a complete set of evaluation results. I
suppose you can have collectionProcessComlete() called multiple times (?) - but
to me it feels right that a single fold is a complete pipeline in and of
itself.
Original comment by phi...@ogren.info
on 8 Feb 2011 at 4:56
So I'm not arguing that the interface I ended up with is the right one for
every task. What I'm saying is that the current evaluation interfaces make it
*impossible* to do what I've done. I think the evaluation interfaces should
make both my use case and your use cases possible. So the goal here should be
to develop some interface that meets both of our needs.
A few specific comments:
"Larry's group does not use views to separate gold annotations from system
annotations - they do this using the type system. They have a notion of an
annotation set. So, your proposed interface will not work for them" I think
that's not true. There's nothing about the EvaluationPipeline interface I
proposed in Comment 3 that would keep them from doing that. In fact, with the
EvaluationPipeline interface, there's no such thing as a special
EvaluationFactory interface - you just create a normal AnalysisEngine and it
does whatever you want. In my case, it compared two views, but in Larry's case,
it would compare the differently typed annotations.
"I do not like the idea of pushing all folds through the same annotator." I
like pushing them all through the same annotator because it simplifies the
evaluation code - it can all be in the same plain old AnalysisEngine. Our APIs
should support both use cases.
"I also like having each fold have a complete set of evaluation results." This
is absolutely possible (and I'm doing it right now) with my proposed API - each
fold is considered a "batch" and batchProcessComplete is called after each
fold. So for me, fold = batch and cv = collection, while for you fold =
collection and cv = multiple collections.
Original comment by steven.b...@gmail.com
on 8 Feb 2011 at 5:44
Somehow I didn't see the code in comment 3 when I posted comment 4 earlier.
The main, if not only, advantage I see is that when you run
collectionProcessComplete you can write out all of the evaluation results that
you've accumulated for all n folds rather than having to "aggregate" them (i.e.
read results from individual folds from respective directories) with
EvaluationFactory.aggregateEvaluationResults(). I agree that this is nice when
everything can be run sequentially on a single box. But if you get five folds
in and it blows up - can you recover the evaluation results that you need from
the four folds that completed successfully? It doesn't seem very natural.
With my solution, the evaluation engine writes out everything it needs to so
that the results can be aggregated later. I suppose I could still do this.
I'm not crazy about having the collection reader coupled with the evaluation
pipeline. I generally pre-process my corpus so that I have XMI files.
However, for one corpus I didn't (for some very specific reason I can't
remember now) and it requires special reading and pre-processing. Do I have to
have a new implementation of EvaluationPipeline for this other corpus? It
seems so.
Also, I like the simplicity of Evaluation.buildCorpusModel(). I think that
there should be an obvious way to build *the* model for a corpus.
Similarly, I would like to keep the datawriting or classifying analysis
engine(s) decoupled from the evaluation. Suppose you finish evaluation and you
want to build *the* model and then tag a whole bunch of unlabeled text. With
the current interface you just pass in a new collection reader and get an
aggregate from EngineFactory.createClassifierAggregate() and go ahead and tag a
bunch of text without worrying about an evalation AE blowing up because
gold-standard data isn't available. This may seem out-of-scope for the project
- but it seems so easy to do here since all the pieces are available. But why
rebuild the part of the pipeline generated by your getTestAnalysisEngines()
method somewhere else?
My feedback probably all sounds negative and/or defensive. I actually get what
you are doing now and somewhat like it (the code helps!) I agree that we could
just as easily have e.g. EngineFactory return an AnalysisEngine instead of an
AnalysisEngineDescription. I'm not clear why it needs to be an array if we
have aggregates.
I think the main differences between our approaches are:
- My interfaces returns an AnalysisEngineDescription and yours returns an
AnalysisEngine[]. I am fine with the latter but would prefer it not to be an
array since aggregates are so easy to build and this simplifies the code of the
caller.
- My interfaces decouples the collection reader from the the
training/classifying aggregates, from the evaluation AEs. Your interface puts
them all together. I am not very happy about this.
- Your interface makes intuitive use of batchProcessComplete and
collectionProcessComplete. I like this a lot - esp. for aggregating the
results of the evaluation AEs. I like my approach too - but I think this is a
point where both approaches ought to be accomodated.
- My approach (see Evaluation) imposes/assumes some structure of the folders
that are produced during cross-validation. In particular, it assumes folders
at all (see nearly every interface method signature, except, ironically
CorpusFactory.) Your approach does not. My solution could be abstracted
better to avoid assuming directories for everything. It makes just as much
sense to have a scenario in which all evaluation for every fold and the
aggregate results are all written to a single file. Your approach facilitates
this.
Hopefully, this is a fairly comprehensive listing of the issues.
Original comment by phi...@ogren.info
on 8 Feb 2011 at 7:32
I'm not committed to AnalysisEngine[] vs. AnalysisEngine. I just used
AnalysisEngine[] because that's what made the user's code the simplest. It's
not much overhead for the user to wrap that in an aggregate.
I have no problem splitting the CollectionReader stuff from the AnalysisEngine
stuff. In my case, I'll probably just have one class implement the two
interfaces, but that's easy enough. I was considering factoring that out
because it would be natural to have an implementation based on a
FilesCollectionReader that does all the file name splitting my code above did.
I also think the core interfaces should not assume a directory. That doesn't
mean the various annotators shouldn't use one - e.g. it would be trivial to add
a PARAM_OUTPUT_DIRECTORY to my SpanEvaluationAnnotator so that it writes
results for each fold (and for the overall evaluation) to the directory. But
the basic evaluation interfaces should not force a directory upon you.
Other specific comments:
"But if you get five folds in and it blows up - can you recover the evaluation
results" - sure, in the same way you do now - add a configuration parameter
that gives you an output directory, and write intermediate results there.
"I like the simplicity of Evaluation.buildCorpusModel()" - In general, I think
we should be trying to make user code simpler, and not worry too much how
complicated our code looks. That said, the EvaluationPipeline equivalent would
look like:
public static void train(EvaluationPipeline pipeline) throws Exception {
CollectionReader reader = pipeline.getCollectionReader();
AnalysisEngine[] engines = pipeline.getTrainAnalysisEngines();
pipeline.configureForTraining();
SimplePipeline.runPipeline(reader, engines);
pipeline.train()
}
So that's certainly no more complicated than Evaluation.buildCorpusModel().
"I would like to keep the datawriting or classifying analysis engine(s)
decoupled from the evaluation" - I would say that the datawriting and
classifying analysis engines are 100% decoupled from the evaluation with the
EvaluationPipeline interface - there is *no* assumption that you'll actually
run an evaluation annotator, and no API required for such an annotator. You can
do whatever you want in the evaluation annotator, including not adding one at
all.
"Suppose you finish evaluation and you want to build *the* model and then tag a
whole bunch of unlabeled text... This may seem out-of-scope for the project"
Yes, I think this is out of scope for the evaluation project. How hard is it to
create an separate aggregate for tagging new text when we're already creating a
separate aggregate for training and testing?
Original comment by steven.b...@gmail.com
on 8 Feb 2011 at 11:38
I spent some time trying to refactor my solution to accommodate your use case.
Attached is a diff file for the cleartk-ml project. The javadocs are out of
date hear so don't squint too hard at them. There's a lot to say
point-by-point about what I did here - but I don't have time to comment
extensively just now.
Original comment by phi...@ogren.info
on 9 Feb 2011 at 12:10
Attachments:
Ok, I think we're converging. ;-) I refactored my solution to look a bit more
like yours. Here are the interfaces I ended up with (an Evaluation class and a
sample FilesReaderFactory implemented on top of these are attached):
public interface ReaderFactory {
public CollectionReader getTrainFoldReader(int fold, int totalFolds) throws UIMAException;
public CollectionReader getTestFoldReader(int fold, int totalFolds) throws UIMAException;
public CollectionReader getTrainReader() throws UIMAException;
public CollectionReader getTestReader() throws UIMAException;
public CollectionReader getReader() throws UIMAException;
}
public interface EngineFactory {
public AnalysisEngine[] getTrainFoldDataWriterEngines(int fold, int totalFolds) throws UIMAException;
public AnalysisEngine[] getTestFoldClassifierEngines(int fold, int totalFolds) throws UIMAException;
public AnalysisEngine[] getTrainDataWriterEngines() throws UIMAException;
public AnalysisEngine[] getTestClassifierEngines() throws UIMAException;
public AnalysisEngine[] getDataWriterEngines() throws UIMAException;
public void trainFold(int fold, int totalFolds) throws Exception;
public void train() throws Exception;
}
So what are the remaining differences?
(1) I ended up still returning AnalysisEngine[] instead of a single
AnalysisEngine because as far as I can tell, there's no way with UimaFIT to
create an aggregate from an AnalysisEngine[], only from an
AnalysisEngineDescription[], and for my use case it has to be AnalysisEngines.
I don't know enough about aggregate AnalysisEngines to know whether this is
just not possible in UIMA or if we're just missing the convenience method in
UimaFIT.
(2) I don't have any use for EvaluationFactory, so my Evaluation methods don't
take one. This is easily solved - we should just have two versions of each
method, one that takes an EvaluationFactory and one that doesn't.
(3) My EngineFactory doesn't assume that you'll be writing to a directory. So
the EngineFactory methods parallel the ReaderFactory methods, e.g.
getTrainFoldDataWriterEngines(int, int) parallels getTrainFoldReader(int, int).
In your solution, EngineFactory.createDataWritingAggregate and
EngineFactory.createClassifierAggregate both take a File parameter. (Note also
that these two methods are also broken for my use case because they return
AnalysisEngineDescriptions, not AnalysisEngines.)
(4) I don't have a getPreprocessor() method. I don't really understand the
point of this method - if you have a preprocessor, why aren't you just
including that in the aggregates created by createDataWritingAggregate() and
createClassifierAggregate()?
(5) My ReaderFactory doesn't assume that there's a single set number of
cross-validation folds for each corpus. So instead of making a corpus define
CorpusFactory.numberOfFolds(), I just pass the total number of folds to all the
get*FoldReader methods. This allows the user to decide how many folds they want
at the time that they call Evaluation.crossValidate instead of hard-coding it
in the CorpusFactory.
(6) In Evaluation.crossValidate, I still have 1 fold = 1 batch and all folds =
1 collection, while you still have 1 fold = 1 collection and all folds = many
collections. This is the one issue I'm not sure how to resolve. Perhaps we
should add two versions of the crossValidate method, one for each
interpretation.
(7) Our naming schemes are a bit different. These are all minor points compared
with the above, but I'll mention them here anyway. I like ReaderFactory better
than CorpusFactory because it's returning CollectionReader objects, not Corpus
objects, and I like having "Fold" or something like that in the methods like
getTrainFoldReader or getTestFoldClassifierEngines because it makes it clearer
to me that these methods are used for cross validations.
Original comment by steven.b...@gmail.com
on 9 Feb 2011 at 11:19
Attachments:
(1) I am fine with returning an array
(2) fine by me. I like the evaluation factory. I think the evaluation
pipeline is decidedly different and separate from the training/classifying
pipeline. My evaluation pipeline for my coordination task has 3 AEs in it. I
also have two approaches for coordination that have a completely different set
of AEs. So, I do not want to deal with cross-product issues either.
(3) No. The EngineFactory was refactored in the patch I attached to comment 8.
The methods do not return AEDs and they do not take a directory.
(4) Preprocessing is specific to the corpus you are dealing with. If you are
using XMI files, then you typically don't need this method because you have
done all your preprocessing ahead of time. However, if you are dealing with
some native file format for some annotated corpus, you will likely want to run
some preprocessing annotators that, for example, parse XML (or PTB, etc.) into
the annotations that are needed downstream. Corpora that ultimately provide
the exact same kinds of annotations can come in many formats....
(5) I knew we would disagree about this one. I think that the number of folds
should be determined by the corpus factory regardless of whether this is
decided at runtime (this could be accomplished by an initialization parameter)
or hard-wired in (for corpora where the folds are fixed.) Not every corpus is
easily split into an arbitrary n number of folds. I don't understand why
EngineFactory needs to know how many folds there are.
(6) In the patch I added to comment 8 I created a "complete" method for a given
fold which is implemented by the EngineFactory. In your solution this method
would call batchProcessComplete there. There is also an aggregateResults
method where you would call collectionProcessComplete. I haven't decided
exactly how I will use these two methods for the case where you write out
evaluation data to be read in later for aggregating. I will say that after
scrutinizing your crossValidate method in comment three - I do not think your
use of batchProcessComplete and collectionProcessComplete is very consistent or
intuitive. Why should a user call collectionProcessComplete for the training
engines for each fold but call batchProcessComplete for the testing engines for
the same folds? This is very confusing I think. This is another reason to
abstract away these calls and put them in the factory as I have done here.
(7) I like "corpus" because it communicates a specific set of data that
generally has a name and specific properties. How about CorpusReaderFactory?
I wasn't consistent with my naming - I used runName and foldName where I meant
to use runName consistently. I'm ambivalent about this. I am confused about
what the name should be because there is no "fold" per se when you run holdout
evaluation.
I am not crazy about "EngineFactory" - I would be happy to hear other
suggestions.
Original comment by phi...@ogren.info
on 9 Feb 2011 at 4:12
(3) Yep, I misread the diff. They don't take a directory any more - they take a
name string. The assumption is that the AnalysisEngine pipeline should never
need to know whether it's being used for regular training or for cross
validation training. I was trying not to make any assumption like that, but I
guess it's not a crazy assumption to make. So I guess I'm okay with this. (Note
that this answers part of your (5) question, why EngineFactory took the fold
number.)
(4) I'm not really sold on the 1 corpus = 1 preprocessor. How I use the corpus
depends on the preprocessing - if I want to use Penn Treebank to train a part
of speech tagger, I may not want to load a bunch of trees. I think the
preprocessing is pipeline dependent, not corpus dependent. But I guess it
doesn't really hurt as long as it gets some better documentation explaining
when you would put something with the corpus reader factory and when you would
put something with the EngineFactory.
(5) But why not just throw an exception if you get a number of folds that is
invalid for your corpus? You have to do that somewhere anyway. I guess I've
just never worked with a corpus before that had set cross-validation folds, so
conceptually it seems like the wrong place to put that.
(6a) "I created a "complete" method for a given fold which is implemented by
the EngineFactory". You mean EvaluationFactory, right? So for my use case, how
does the EvaluationFactory get access to the test engine to call
batchProcessComplete in "complete" and collectionProcessComplete in
"aggregateResults"?
(6b) "Why should a user call collectionProcessComplete for the training engines
for each fold but call batchProcessComplete for the testing engines for the
same folds?" Because you want to train N classifiers but you only want 1 final
evaluation number. So there should be N collectionProcessComplete calls for
training, and 1 collectionProcessComplete call for testing. But we don't have
to agree on this as long as we support both approaches.
(7)
"CorpusReaderFactory" - better for sure, though not amazing. ;-)
"runName" - I think you should just call it "name" or "id" - it's just an
arbitrary identifier made up by Evaluation, so reading anything more into it
would be a mistake.
"EngineFactory" - yeah, I didn't like this either because it has a "train"
method.
Original comment by steven.b...@gmail.com
on 9 Feb 2011 at 5:33
(4) You can imagine that you might have a PTBCorpusReader_ImplBase from which
two implementations that have different preprocess() implementations.
(5) My preference is to split my corpus up into fixed folds for maximum
repeatability. So, that's what I've always done. It seems that some
FileCorpusReader_ImplBase could be initialized with the number of folds at
runtime and you could push the logic you had in configureForTesting in comment
3 into that class.
(6a) my mistake. Yes - the evaluation factory
(6b) this is confusing.
(7) yes - 'name' is better. One candidate might be CleartkAnnotatorFactory -
it seems a bit narrow for what it does (considering it has a train and could
have any number of preprocessing AEs for the target CleartkAnnotator) - but it
communicates better than EngineFactory. Open for suggestions.
Are we still disagreeing on anything? I think I will commit my diff since it
seems to be closer to a solution than the current code and we can go from
there.
Original comment by phi...@ogren.info
on 9 Feb 2011 at 6:17
Yes, please commit your current diff. And if it's not too much work, move
things to cleartk-eval with maybe version 0.7.0 or something like that? I think
we're converging to a good solution here, but we should definitely give it a
bit of time and experience before we mark it 0.9.9.
For (5), how about we have a getNumberOfFolds and setNumberOfFolds? That would
make it feel less like there was a single "correct" number of folds for a given
corpus, and you could put in setNumberOfFolds any code you need for checking
that you've received a valid number of folds.
For (6), I'm still not sure how my EvaluationFactory is supposed to get to the
AnalysisEngines to call collectionProcessComplete and/or batchProcessComplete.
This is the one serious problem remaining for me.
For (7), I'm not convinced yet that CleartkAnnotatorFactory is better than
EngineFactory, so I guess just leave it for the moment.. I'll try to brainstorm
for better names.
Original comment by steven.b...@gmail.com
on 9 Feb 2011 at 9:18
(repeated from email I meant to post as a comment)
I committed the code earlier today.
I will also create the new project and move it over.
(5) sounds good.
(6) look at how Evaluation.java calls EvaluationFactory.writeResults() and
EvaluationFactory.aggregateResults(). (I just renamed writeResults in a commit
from a couple minutes ago.)
(7) I renamed EngineFactory to CleartkAnnotatorFactory before I committed the
code. I will leave it that way until we come up with a better name. Given
that everything called "Factory" has no "create" methods anymore we should stop
calling them factories. I described it to Chris and he suggested "Cache".
Maybe "Provider" would work. I sort of like that better. Here's a straw man:
CorpusReaderProvider
CleartkPipelineProvider
EvaluationProvider
Original comment by phi...@ogren.info
on 9 Feb 2011 at 10:28
(5) Since I have a corpus of tutorial dialogue transcripts, I was thinking that
instead of n-fold cross-validation, I would be doing leave-one-student-out
cross-validation. How does that play with this situation? Is it the same as
when I have pre-specified folds?
Original comment by lee.becker
on 9 Feb 2011 at 10:33
@Philip:
(6) I still don't understand how I can implement
EvaluationFactory.writeResults() so that it calls batchProcessComplete() - the
EvaluationFactory doesn't have any access to the pipeline, only the
EngineFactory has access to the pipeline. Or are you suggesting that I write my
own Evaluation.java that calls batchProcessComplete() instead of
EvaluationFactory.writeResults() and collectionProcessComplete() instead of
EvaluationFactory.aggregateResults()? (Hopefully not - I think we should
arrange the API so that Evaluation.java works for all our use cases.)
(7) +0.5 on all the Provider names - definitely the best I've seen from what we
had so far.
@Lee
(5) If you know you'll only ever do leave-one-out cross-validation on that
corpus, then define getNumberOfFolds() to return the number of transcripts and
setNumberOfFolds() to throw UnsupportedOperationException. If you think you
might do other types of cross-validation, but you want to make it easy to do
leave-one-out cross-validation when you want to, then you might add to that
specific CorpusReaderProvider a method like getNumberOfTranscripts() which you
can then pass to setNumberOfFolds().
Original comment by steven.b...@gmail.com
on 9 Feb 2011 at 11:05
as long as you know how many "instances" you have then it should be easy to
specify the number of folds for leave-one-out. Still, it seems that there
should be an ImplBase that does this for you. This should be possible with the
current interface.
Original comment by phi...@ogren.info
on 9 Feb 2011 at 11:07
(5) I should clarify. I am doing dialog move classification on the turns
within a transcript. Also a given student can have multiple transcripts,
consequently my folds will be designed to ensure I'm not training and testing
on the utterances of a student. In my case leave-one-out is not
leave-one-instance out but leave all of a student's utterance instances out.
Original comment by lee.becker
on 9 Feb 2011 at 11:14
@Steve:
(6) right. I forgot that you are going to put your evaluation AEs in the list
of analysis engines returned by (current working name) CleartkPipelineProvider.
This would work if you would be willing to implement an (current working name)
EvaluationPipelineProvider. The idea is that your implementation of
EvaluationPipelineProvider would call batchProcessComplete in writeResults on
those analysis engines that that need to write evaluation data.
Let me fix up the code that I'm working on assuming you would be willing to do
this so that we have a version we can look at with the current names and some
of the additional changes above. If you can't warm up to the notion of using
the EvaluationPipelineProvider, then we will have to figure out how to give you
the hooks that you need.
Original comment by phi...@ogren.info
on 9 Feb 2011 at 11:16
Yep, fix up the code and I'll take a look at it tomorrow.
I was already assuming i would implement a EvaluationPipelineProvider to call
batchProcessComplete and collectionProcessComplete (which I'm fine with
implementing) but I'm still not clear on how that EvaluationPipelineProvider
would get access to the pipeline that was being finished because writeResults
takes only a String parameter, not an AnalysisEngine or AnalysisEngine[]
parameter.
I'm pretty sure that even for the other use cases, your code isn't even calling
collectionProcessComplete() for the "classifierAggregate". (It is for the
"dataWritingAggregate" implicitly through SimplePipeline.runPipeline though.)
At first I though that was because you intended to call
collectionProcessComplete() from evaluationFactory.writeResults() but that's
not possible with the current API.
(Also, we probably shouldn't use JCasIterable or SimplePipeline.runPipeline in
Evaluation.java - these are convenience methods that swallow some UIMA
exceptions and turn them into RuntimeExceptions, which is fine for a simple
main method, but probably shouldn't be used in other code. We could fix
SimplePipeline.runPipeline to not use JCasIterable, and perhaps we should do
that, but there's no real way to fix JCasIterable, so we should just avoid
using it for anything but simple main methods.)
Original comment by steven.b...@gmail.com
on 9 Feb 2011 at 11:29
I have committed all the changes that I have for now. Please scrutinize.
I think I am both confused and confusing. Because EvaluationPipelineProvider
produced a list of analysis engines for a given name it can call
batchProcessComplete (as it sees fit) on the same analysis engines (since it is
given the name.) However, callers of aggregateResults will have to be aware
that the EvaluationPipelineProvider will aggregate results for whatever
pipelines it has served up - i.e. you wouldn't want to run the holdout
evaluation and then cross validation because then the aggregation will include
evaluation data from both. Maybe we should have a reset() method of some sort?
I cleaned up the code a bit so that it isn't calling SimplePipeline any more
and it is clear where I am calling collectionProcessComplete and where I'm not.
That said, the code is wrong and I am not sure what to do. I rather think
that we shouldn't be calling collectionProcessComplete on any of the analysis
engines and that we should send this back to the providers to do as they see
fit. So, this makes me think that CleartkPipelineProvider needs methods
analogous to writeResults() and aggregateResults(). It would be nice if the
methods were the same for both. Maybe pipelineComplete(name) and
evaluationComplete()?
Original comment by phi...@ogren.info
on 10 Feb 2011 at 1:06
I just noticed a potential hole in our CleartkPipelineProvider. The train
method accepts a name and a single set of training arguments. This assumes
that there is a single model being built, or that all the models being built
will take the same training arguments. This is fine for many scenarios (both
of which I've found myself in.) But I can imagine that for Lee's use - he is
going to want to train different models using different learners each with
different training arguments to evaluate his system. This is true not just for
CleartkMultiAnnotator but also for cases where a pipeline has multiple
CleartkAnnotators. I suppose since he will be implementing the train method
for his own evaluation infrastructure he can specify the strings to be whatever
he wants them to be - perhaps using Args4j to parse argument variables.
Original comment by phi...@ogren.info
on 10 Feb 2011 at 1:08
While I would love the flexibility to tweak each classifier, it doesn't seem
practical to be tweaking the learner parameters for 30+ different classifiers.
As it stands now, CleartkMultiAnnotator only takes one learner, and we would
need a more sophisticated mechanism to pair each classifier with a different
learning algorithm.
FWIW, I will probably focus more on experiments (in the near term at least)
where I turn sets of features off globally for all of my classifiers.
Original comment by lee.becker
on 10 Feb 2011 at 1:19
Ok, I see how it works now. EvaluationPipelineProvider.getEvaluationPipeline
doesn't provide the whole testing + results pipeline, just the evaluation bit
at the end.
I tried to update my code to use the new interfaces. Some problems I ran into:
= get*Reader call order =
Evaluation is currently broken for my use case because it calls getTrainReader,
then getTestReader, and then runs pipelines. But since my train and test
readers are the same reader, just reconfigured, the train pipeline actually
gets the test reader files, and the test pipeline gets no CASes at all because
they were used up during training. The rule should be "never call get*Reader
until the moment you're actually going to use it". Basically, we need to get
rid of Evaluation.run, which violates this assumption.
= Fold Numbering =
runCrossValidation assumes that folds are numbered starting with 1. This seems
inconsistent with standard Java conventions. Can we have 0 based indexing here?
= Declared Exceptions =
We need to think about what exceptions all the methods should throw. For
example, for my use case, I'm typically calling
ConfigurableResource.reconfigure() in all the CorpusReaderPipeline methods and
the the CleartkPipelineProvider.get*Pipeline methods, and that throws a
ResourceConfigurationException, not a ResourceInitializationException. As
another example, in EvaluationPipelineProvider.writeResults, I'm calling
batchProcessComplete, and that throws a AnalysisEngineProcessException. Perhaps
all the methods should be throwing UIMAException? The one exception would be
org.cleartk.eval.provider.CleartkPipelineProvider.train which should throw
Exception to match JarClassifierBuilder.trainClassifier.
= Evaluation.buildCorpusModel =
I think Evaluation.buildCorpusModel still has the wrong signature. Why is there
a File there? I think it should just be a String id (a.k.a. runName) like
everywhere else. Also, Evaluation.buildCorpusModel shouldn't be calling
dataWritingPipeline.add - we shouldn't be modifying the list returned by
getTrainingPipeline.
= JarClassifierBuilder.trainAndPackage =
I don't think Evaluation.train should delegate to Train.main. Instead, it
should just use the following code:
JarClassifierBuilder<?> classifierBuilder = JarClassifierBuilder.fromTrainingDirectory(modelDirectory);
classifierBuilder.trainClassifier(modelDirectory, trainingArguments);
classifierBuilder.packageClassifier(modelDirectory);
Perhaps instead of having Evaluation.train, we should add a static method to
JarClassifierBuilder? I mean, if we're going to assume that it's a
JarClassifierBuilder directory (as Train.main does), then JarClassifierBuilder
is probably the right place for that method.
= collectionProcessComplete =
"CleartkPipelineProvider needs methods analogous to writeResults() and
aggregateResults() ... Maybe pipelineComplete(name) and evaluationComplete()?"
Yes, this sounds good to me, but in your use case you don't keep a reference to
the training and evaluation pipelines around, so you won't be able to call
collectionProcessComplete from just a String identifier. So I think the
signatures need to be:
void trainingPipelineComplete(String, List<AnalysisEngine>)
void classifyingPipelineComplete(String, List<AnalysisEngine>)
void evaluationPipelineComplete(String, List<AnalysisEngine>)
These methods would be called every time you go through a pipeline, e.g. once
for a train/test run, and once for each fold for a cross-validation run. We'd
also need the methods:
void trainingComplete()
void classifyingComplete()
void evaluationComplete()
These methods would be called exactly once, at the end of runCrossValidation(),
runHoldoutEvaluation() and buildCorpusModel(). I think that should work for
both of our use cases:
* For training and classifying, I would call collectionProcessComplete() in
trainingPipelineComplete() and classifyingPipelineComplete(). For evaluation, I
would call batchProcessComplete() in evaluationPipelineComplete() and
collectionProcessComplete() in evaluationComplete().
* You would call collectionProcessComplete() in all the xxxPipelineComplete()
methods, and do whatever you used to do in aggregateResults() in
evaluationComplete().
Attached is a patch which makes all the changes proposed above.
Original comment by steven.b...@gmail.com
on 10 Feb 2011 at 12:22
Attachments:
= get*Reader call order =
My bad. Looks better now.
= Fold Numbering =
true. I wanted to have the folder name of the first fold to be "fold-1"
instead of "fold-0". This can easily be done in createFoldName (which has a
minor bug related to this issue - either the digit that is used for the file
name needs to be incremented or the total folds needs to be decremented.) I
will fix the method. Btw - I asked you for that string formatting trickiery a
few weeks ago and you didn't know what I was talking about.... ;)
= Declared Exceptions =
sounds good.
= Evaluation.buildCorpusModel =
I forgot to revisit this method. looks better now.
I agree with the other points you make above.
Hopefully, we haven't worked too hard to accommodate our two use cases. The
code doesn't look all that complicated - so hopefully the abstraction is good.
I will have a better feeling about it after I port my evaluation code to this.
We definitely need to document this code extensively before closing this issue.
I am refactoring XmiCorpusFactory so that it doesn't assume fixed folds. I
will commit this today.
Original comment by phi...@ogren.info
on 10 Feb 2011 at 6:22
Yep, let me know how your port goes. If everything works out with your port,
we'll go for a documentation spree on these interfaces.
"createFoldName (which has a minor bug related to this issue - either the digit
that is used for the file name needs to be incremented or the total folds needs
to be decremented.)" Are you sure? It looks right to me:
scala> import org.cleartk.eval.Evaluation.createFoldName
import org.cleartk.eval.Evaluation.createFoldName
scala> for (i <- 0 until 9) yield createFoldName(i, 9)
res1: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(0, 1, 2,
3, 4, 5, 6, 7, 8)
scala> for (i <- 0 until 10) yield createFoldName(i, 10)
res2: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(0, 1, 2,
3, 4, 5, 6, 7, 8, 9)
scala> for (i <- 0 until 11) yield createFoldName(i, 11)
res3: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(00, 01,
02, 03, 04, 05, 06, 07, 08, 09, 10)
But yes, I'm fine with you modifying it so that it numbers from 1 instead of 0
when it creates the name.
"I asked you for that string formatting trickiery a few weeks ago and you
didn't know what I was talking about". Actually you asked me about "creating a
string of n repeating characters" which is different from asking to fill in
zeros before an integer. There's a special string format operation for
prefixing zeros (e.g. %04d means 4 digits, filling with zeros as necessary)
while there's no special string format operation for the general case.
Original comment by steven.b...@gmail.com
on 10 Feb 2011 at 6:48
When I run EvalationTest I get the following error:
java.util.DuplicateFormatFlagsException: Flags = '0'
at java.util.Formatter$Flags.parse(Formatter.java:4140)
at java.util.Formatter$FormatSpecifier.flags(Formatter.java:2555)
at java.util.Formatter$FormatSpecifier.<init>(Formatter.java:2625)
at java.util.Formatter.parse(Formatter.java:2480)
at java.util.Formatter.format(Formatter.java:2414)
at java.util.Formatter.format(Formatter.java:2367)
at java.lang.String.format(String.java:2769)
at org.cleartk.eval.Evaluation.createFoldName(Evaluation.java:114)
at org.cleartk.eval.EvaluationTest.testCreateFoldName(EvaluationTest.java:43)
Original comment by phi...@ogren.info
on 10 Feb 2011 at 7:34
Did you mean to name org.cleartk.eval.provider.CorpusReaderPipeline
org.cleartk.eval.provider.CorpusReaderProvider?
Original comment by phi...@ogren.info
on 10 Feb 2011 at 7:58
@Comment 27
I assume you were passing in either 0 or 1 total folds? Shouldn't that method
just throw an exception for 0 or 1 total folds? (A different, more informative
exception, of course.)
@Comment 28:
Yeah, I didn't rename anything you hadn't renamed, but I agree that should be
called Provider.
Original comment by steven.b...@gmail.com
on 11 Feb 2011 at 10:47
Fixed in r2445, with renamings etc. up through r2818.
Original comment by steven.b...@gmail.com
on 28 Apr 2011 at 10:02
Original issue reported on code.google.com by
steven.b...@gmail.com
on 7 Feb 2011 at 9:14