consider moving evaluation out of cleartk-ml into cleartk-ml-eval

ClearTK / cleartk

Machine learning components for Apache UIMA

http://cleartk.github.io/cleartk/

Other

129 stars 58 forks source link

consider moving evaluation out of cleartk-ml into cleartk-ml-eval #226

Closed bethard closed 9 years ago

bethard commented 9 years ago

Original issue 228 created by ClearTK on 2011-02-07T21:14:35.000Z:

So I've been playing around with the evaluation package, and it feels pretty clunky to me in a number of ways. I don't like the idea of labeling it as 0.9.9 (which is what it will get if it stays in cleartk-ml) because I think that may give the impression that the API are pretty much the way we want them. Could we move this out into a new module, say cleartk-ml-eval, and label it with 0.5.0 or 0.7.0 or something a little less committal than 0.9.9?

Some things that I think are clunky:

I don't like that EngineFactory forces you to use Descriptions. In many cases, it may be possible to use a single AnalysisEngine for both data writing and classifying by simply flipping the isTraining flag (e.g. via a few setConfigurationParameterValue calls and a reconfigure call). But to support this, the EngineFactory methods would need to return AnalysisEngines instead of AnalysisEngineDescriptions, so that you could have a single instance and just reconfigure it when the interface methods were called.
There aren't any default implementations of anything, so it makes it hard to be confident we have the abstraction right. Some things I think we should implement before we declare the APIs satisfactory:
- A FilesCorpusFactory that takes a list of training files and a list of testing files (to be read in as plain text using a FilesCollectionReader) and fills in all the CorpusFactory methods using this information
- An AnnotationSpanEvaluationFactory that takes an annotation type, and compares the spans of the gold annotations of this type to the model annotations using precision, recall and F-measure. And probably similarly for an AnnotationAttributeEvaluationFactory using accuracy.

Don't get me wrong - I don't think the evaluation package is horribly broken or anything. I just think we should label it as something less than a 0.9.9 release until we've pushed its boundaries a bit.

bethard commented 9 years ago

Comment #1 originally posted by ClearTK on 2011-02-07T21:44:10.000Z:

It has occurred to me a number of times recently that the evaluation code should sit in another module/project. It seems that most evaluation code we create could be re-used for inter-annotator agreement and in that context there probably isn't any ML involved. So, I would vote for cleartk-eval.

Is your first concern related to performance?

I agree that the version of the eval code shouldn't be the same as cleartk-ml. That said, I actually have 2 implementations of the interfaces - 1 for part-of-speech tagging (surprise!) and another for sentence segmentation. I also have code that the needs to be refactored for evaluating coordination structures. So, I have a bit more confidence that the interfaces as they are usable and I do have a fair bit of code lying around that will be added soon. I'm actually working on this now, sort of.

I agree that a FilesCorpusReader makes sense to have - but you might also consider using XmiCorpusFactory which I think is a better approach.

bethard commented 9 years ago

Comment #2 originally posted by ClearTK on 2011-02-08T07:23:50.000Z:

My concern is partly performance and partly ease of implementing the API. I feel like in most cases people have a single pipeline, and they just switch their cleartk annotator to isTraining=true during training and their cleartk annotator to isTraining=false during testing. I think we could accomplish this by working with a single pipeline directly and having the user implement methods like:

configureForDataWriting(AnalysisEngine pipeline);

I also feel like the EvaluationFactory is not an ideal API. For every evaluation I can imagine, I just want to implement a method like:

public void evaluation(JCas goldView, JCas systemView);

and maybe write results to stdout and/or a directory in collectionProcessComplete(). This should work exactly the same way for a cross validation - the evaluation infrastructure should stream the correct CASes from each fold through this one single annotator (not create the same evaluation annotator many times).

bethard commented 9 years ago

Comment #3 originally posted by ClearTK on 2011-02-08T16:30:22.000Z:

For comparison, here's the code that I ended up writing for cross-validation, which never re-creates an AnalysisEngine, and only reconfigures the AnalysisEngines that have to be reconfigured.

public static void crossValidate(int totalFolds, EvaluationPipeline pipeline) throws Exception {

// get the readers and engines
CollectionReader reader = pipeline.getCollectionReader();
AnalysisEngine[] trainEngines = pipeline.getTrainAnalysisEngines();
AnalysisEngine[] testEngines = pipeline.getTestAnalysisEngines();

// collect common meta data
List<ResourceMetaData> metaData = new ArrayList<ResourceMetaData>();
metaData.add(pipeline.getCollectionReader().getMetaData());
for (Resource resource : pipeline.getTrainAnalysisEngines()) {
  metaData.add(resource.getMetaData());
}
for (Resource resource : pipeline.getTestAnalysisEngines()) {
  metaData.add(resource.getMetaData());
}

// repeat for each cross-validation fold...
for (int fold = 0; fold < totalFolds; ++fold) {
  pipeline.configureForTraining(fold, totalFolds);

  // run through the training files
  CAS cas = CasCreationUtils.createCas(metaData);
  while (reader.hasNext()) {
    cas.reset();
    reader.getNext(cas);
    for (AnalysisEngine engine : trainEngines) {
      engine.process(cas);
    }
  }

  // close the training engines
  for (AnalysisEngine engine : trainEngines) {
    engine.collectionProcessComplete();
  }

  // train the model
  pipeline.train(fold, totalFolds);

  // configure for testing
  pipeline.configureForTesting(fold, totalFolds);

  // run through the testing files
  cas = CasCreationUtils.createCas(metaData);
  while (reader.hasNext()) {
    cas.reset();
    reader.getNext(cas);
    for (AnalysisEngine engine : testEngines) {
      engine.process(cas);
    }
  }

  // for testing, only a "batch" is done, not the whole collection
  for (AnalysisEngine engine : testEngines) {
    engine.batchProcessComplete();
  }
}

// close the test engines after all folds are complete
for (AnalysisEngine engine : testEngines) {
  engine.collectionProcessComplete();
}

}

public static interface EvaluationPipeline { public CollectionReader getCollectionReader();

public AnalysisEngine[] getTrainAnalysisEngines();

public AnalysisEngine[] getTestAnalysisEngines();

public void configureForTraining(int fold, int totalFolds)
    throws ResourceConfigurationException;

public void train(int fold, int totalFolds) throws Exception;

public void configureForTesting(int fold, int totalFolds) throws ResourceConfigurationException;

}

As an example, here's what my implementation of configureForTesting looks like:

public void configureForTesting(int fold, int totalFolds) throws ResourceConfigurationException {
  // collect the test names for this fold
  List<String> foldTestNamesList = new ArrayList<String>();
  for (int i = 0; i < this.trainNames.size(); ++i) {
    if (i % totalFolds == fold) {
      foldTestNamesList.add(this.trainNames.get(i));
    }
  }

  // configure for testing
  reader.setConfigParameterValue(
      FilesCollectionReader.PARAM_FILE_NAMES,
      foldTestNamesList.toArray(new String[foldTestNamesList.size()]));
  reader.reconfigure();
  goldAnnotator.setConfigParameterValue(StoryEventGoldAnnotator.PARAM_VIEW_NAME, "GoldView");
  goldAnnotator.reconfigure();
  systemAnnotator.setConfigParameterValue(CleartkAnnotator.PARAM_IS_TRAINING, false);
  systemAnnotator.setConfigParameterValue(
      GenericJarClassifierFactory.PARAM_CLASSIFIER_JAR_PATH,
      new File(this.getFoldDirectory(fold, totalFolds), "model.jar").getPath());
  systemAnnotator.reconfigure();
}

Basically I make some setConfigParameterValue calls followed by reconfigure calls for the annotators that need it. (If you look closely, you'll see I had to add a parameter CleartkAnnotator.PARAM_IS_TRAINING, but I think that's probably a useful parameter that we should add anyway.)

bethard commented 9 years ago

Comment #4 originally posted by ClearTK on 2011-02-08T16:56:11.000Z:

So, here are a few things to consider:

The difference between the pipeline that trains and the one that classifies might be quite small as you mention. It might also be quite large. You might have a lot of gold-standard data that you use for training (like tokens, sentences, pos tags, depedendencies, and semantic roles, for example.) When you classify you may not assume any of these things. For my coordination work I had three different corpora that had gold-standard coordination structures but differing amounts of other kinds of gold-standard data. One corpus had nothing but coordination structures (i.e. no tokens, no sentences, no part-of-speech tags.) And, of course, for the two corpora that had more data I experimented with using an automatic tokenizer and gold-standard tokens. Another difference between the two pipelines is that one operated on the gold view and the other worked on the system view and was begun with ViewTextCopierAnnotator(sp). I also had some pre-computed part-of-speech tags that were derived from "leave-one-file-out" tagger training across 36 files. When I ran my training pipeline and I wanted to use these tags the pre-computed tags were copied over the gold-standard tags for that run. When I ran the classifying pipeline I would copy the pre-computed tags from the gold view to the tokens in system view. So, while the two pipelines have a lot of similarities and they used a lot of the same convenience methods I created for setting them up - there were enough differences between the two that I was glad they were created by separate methods and I would not like to see them put back together.

Larry's group does not use views to separate gold annotations from system annotations - they do this using the type system. They have a notion of an annotation set. So, your proposed interface will not work for them. I like the idea of having a default implementation that has something like this.

I do not like the idea of pushing all folds through the same annotator. With the current interfaces it would be trivial to extend org.cleartk.evaluation.Evaluation (see runCrossValidation) so that you could run one fold at a time. This is nice for parallelization. It's great when you can run an entire cross-validation experiment on a single machine in a single run. But we should make it really easy to run each fold on 10 different machines - and not force sequential processing where it is not required. Also, it's nice to run a single fold - for whatever reason.

I also like having each fold have a complete set of evaluation results. I suppose you can have collectionProcessComlete() called multiple times (?) - but to me it feels right that a single fold is a complete pipeline in and of itself.

bethard commented 9 years ago

Comment #5 originally posted by ClearTK on 2011-02-08T17:44:53.000Z:

So I'm not arguing that the interface I ended up with is the right one for every task. What I'm saying is that the current evaluation interfaces make it impossible to do what I've done. I think the evaluation interfaces should make both my use case and your use cases possible. So the goal here should be to develop some interface that meets both of our needs.

A few specific comments:

"Larry's group does not use views to separate gold annotations from system annotations - they do this using the type system. They have a notion of an annotation set. So, your proposed interface will not work for them" I think that's not true. There's nothing about the EvaluationPipeline interface I proposed in Comment 3 that would keep them from doing that. In fact, with the EvaluationPipeline interface, there's no such thing as a special EvaluationFactory interface - you just create a normal AnalysisEngine and it does whatever you want. In my case, it compared two views, but in Larry's case, it would compare the differently typed annotations.

"I do not like the idea of pushing all folds through the same annotator." I like pushing them all through the same annotator because it simplifies the evaluation code - it can all be in the same plain old AnalysisEngine. Our APIs should support both use cases.

"I also like having each fold have a complete set of evaluation results." This is absolutely possible (and I'm doing it right now) with my proposed API - each fold is considered a "batch" and batchProcessComplete is called after each fold. So for me, fold = batch and cv = collection, while for you fold = collection and cv = multiple collections.

bethard commented 9 years ago

Comment #6 originally posted by ClearTK on 2011-02-08T19:32:57.000Z:

Somehow I didn't see the code in comment 3 when I posted comment 4 earlier.

The main, if not only, advantage I see is that when you run collectionProcessComplete you can write out all of the evaluation results that you've accumulated for all n folds rather than having to "aggregate" them (i.e. read results from individual folds from respective directories) with EvaluationFactory.aggregateEvaluationResults(). I agree that this is nice when everything can be run sequentially on a single box. But if you get five folds in and it blows up - can you recover the evaluation results that you need from the four folds that completed successfully? It doesn't seem very natural. With my solution, the evaluation engine writes out everything it needs to so that the results can be aggregated later. I suppose I could still do this.

I'm not crazy about having the collection reader coupled with the evaluation pipeline. I generally pre-process my corpus so that I have XMI files. However, for one corpus I didn't (for some very specific reason I can't remember now) and it requires special reading and pre-processing. Do I have to have a new implementation of EvaluationPipeline for this other corpus? It seems so.

Also, I like the simplicity of Evaluation.buildCorpusModel(). I think that there should be an obvious way to build the model for a corpus.

Similarly, I would like to keep the datawriting or classifying analysis engine(s) decoupled from the evaluation. Suppose you finish evaluation and you want to build the model and then tag a whole bunch of unlabeled text. With the current interface you just pass in a new collection reader and get an aggregate from EngineFactory.createClassifierAggregate() and go ahead and tag a bunch of text without worrying about an evalation AE blowing up because gold-standard data isn't available. This may seem out-of-scope for the project - but it seems so easy to do here since all the pieces are available. But why rebuild the part of the pipeline generated by your getTestAnalysisEngines() method somewhere else?

My feedback probably all sounds negative and/or defensive. I actually get what you are doing now and somewhat like it (the code helps!) I agree that we could just as easily have e.g. EngineFactory return an AnalysisEngine instead of an AnalysisEngineDescription. I'm not clear why it needs to be an array if we have aggregates.

I think the main differences between our approaches are:

My interfaces returns an AnalysisEngineDescription and yours returns an AnalysisEngine[]. I am fine with the latter but would prefer it not to be an array since aggregates are so easy to build and this simplifies the code of the caller.
My interfaces decouples the collection reader from the the training/classifying aggregates, from the evaluation AEs. Your interface puts them all together. I am not very happy about this.
Your interface makes intuitive use of batchProcessComplete and collectionProcessComplete. I like this a lot - esp. for aggregating the results of the evaluation AEs. I like my approach too - but I think this is a point where both approaches ought to be accomodated.
My approach (see Evaluation) imposes/assumes some structure of the folders that are produced during cross-validation. In particular, it assumes folders at all (see nearly every interface method signature, except, ironically CorpusFactory.) Your approach does not. My solution could be abstracted better to avoid assuming directories for everything. It makes just as much sense to have a scenario in which all evaluation for every fold and the aggregate results are all written to a single file. Your approach facilitates this.

Hopefully, this is a fairly comprehensive listing of the issues.

bethard commented 9 years ago

Comment #7 originally posted by ClearTK on 2011-02-08T23:38:43.000Z:

I'm not committed to AnalysisEngine[] vs. AnalysisEngine. I just used AnalysisEngine[] because that's what made the user's code the simplest. It's not much overhead for the user to wrap that in an aggregate.

I have no problem splitting the CollectionReader stuff from the AnalysisEngine stuff. In my case, I'll probably just have one class implement the two interfaces, but that's easy enough. I was considering factoring that out because it would be natural to have an implementation based on a FilesCollectionReader that does all the file name splitting my code above did.

I also think the core interfaces should not assume a directory. That doesn't mean the various annotators shouldn't use one - e.g. it would be trivial to add a PARAM_OUTPUT_DIRECTORY to my SpanEvaluationAnnotator so that it writes results for each fold (and for the overall evaluation) to the directory. But the basic evaluation interfaces should not force a directory upon you.

Other specific comments:

"But if you get five folds in and it blows up - can you recover the evaluation results" - sure, in the same way you do now - add a configuration parameter that gives you an output directory, and write intermediate results there.

"I like the simplicity of Evaluation.buildCorpusModel()" - In general, I think we should be trying to make user code simpler, and not worry too much how complicated our code looks. That said, the EvaluationPipeline equivalent would look like:

public static void train(EvaluationPipeline pipeline) throws Exception { CollectionReader reader = pipeline.getCollectionReader(); AnalysisEngine[] engines = pipeline.getTrainAnalysisEngines(); pipeline.configureForTraining(); SimplePipeline.runPipeline(reader, engines); pipeline.train() }

So that's certainly no more complicated than Evaluation.buildCorpusModel().

"I would like to keep the datawriting or classifying analysis engine(s) decoupled from the evaluation" - I would say that the datawriting and classifying analysis engines are 100% decoupled from the evaluation with the EvaluationPipeline interface - there is no assumption that you'll actually run an evaluation annotator, and no API required for such an annotator. You can do whatever you want in the evaluation annotator, including not adding one at all.

"Suppose you finish evaluation and you want to build the model and then tag a whole bunch of unlabeled text... This may seem out-of-scope for the project" Yes, I think this is out of scope for the evaluation project. How hard is it to create an separate aggregate for tagging new text when we're already creating a separate aggregate for training and testing?

bethard commented 9 years ago

Comment #8 originally posted by ClearTK on 2011-02-09T00:10:38.000Z:

I spent some time trying to refactor my solution to accommodate your use case. Attached is a diff file for the cleartk-ml project. The javadocs are out of date hear so don't squint too hard at them. There's a lot to say point-by-point about what I did here - but I don't have time to comment extensively just now.

bethard commented 9 years ago

Comment #9 originally posted by ClearTK on 2011-02-09T11:19:36.000Z:

Ok, I think we're converging. ;-) I refactored my solution to look a bit more like yours. Here are the interfaces I ended up with (an Evaluation class and a sample FilesReaderFactory implemented on top of these are attached):

public interface ReaderFactory { public CollectionReader getTrainFoldReader(int fold, int totalFolds) throws UIMAException; public CollectionReader getTestFoldReader(int fold, int totalFolds) throws UIMAException; public CollectionReader getTrainReader() throws UIMAException; public CollectionReader getTestReader() throws UIMAException; public CollectionReader getReader() throws UIMAException; }

public interface EngineFactory { public AnalysisEngine[] getTrainFoldDataWriterEngines(int fold, int totalFolds) throws UIMAException; public AnalysisEngine[] getTestFoldClassifierEngines(int fold, int totalFolds) throws UIMAException; public AnalysisEngine[] getTrainDataWriterEngines() throws UIMAException; public AnalysisEngine[] getTestClassifierEngines() throws UIMAException; public AnalysisEngine[] getDataWriterEngines() throws UIMAException; public void trainFold(int fold, int totalFolds) throws Exception; public void train() throws Exception; }

So what are the remaining differences?

(1) I ended up still returning AnalysisEngine[] instead of a single AnalysisEngine because as far as I can tell, there's no way with UimaFIT to create an aggregate from an AnalysisEngine[], only from an AnalysisEngineDescription[], and for my use case it has to be AnalysisEngines. I don't know enough about aggregate AnalysisEngines to know whether this is just not possible in UIMA or if we're just missing the convenience method in UimaFIT.

(2) I don't have any use for EvaluationFactory, so my Evaluation methods don't take one. This is easily solved - we should just have two versions of each method, one that takes an EvaluationFactory and one that doesn't.

(3) My EngineFactory doesn't assume that you'll be writing to a directory. So the EngineFactory methods parallel the ReaderFactory methods, e.g. getTrainFoldDataWriterEngines(int, int) parallels getTrainFoldReader(int, int). In your solution, EngineFactory.createDataWritingAggregate and EngineFactory.createClassifierAggregate both take a File parameter. (Note also that these two methods are also broken for my use case because they return AnalysisEngineDescriptions, not AnalysisEngines.)

(4) I don't have a getPreprocessor() method. I don't really understand the point of this method - if you have a preprocessor, why aren't you just including that in the aggregates created by createDataWritingAggregate() and createClassifierAggregate()?

(5) My ReaderFactory doesn't assume that there's a single set number of cross-validation folds for each corpus. So instead of making a corpus define CorpusFactory.numberOfFolds(), I just pass the total number of folds to all the get*FoldReader methods. This allows the user to decide how many folds they want at the time that they call Evaluation.crossValidate instead of hard-coding it in the CorpusFactory.

(6) In Evaluation.crossValidate, I still have 1 fold = 1 batch and all folds = 1 collection, while you still have 1 fold = 1 collection and all folds = many collections. This is the one issue I'm not sure how to resolve. Perhaps we should add two versions of the crossValidate method, one for each interpretation.

(7) Our naming schemes are a bit different. These are all minor points compared with the above, but I'll mention them here anyway. I like ReaderFactory better than CorpusFactory because it's returning CollectionReader objects, not Corpus objects, and I like having "Fold" or something like that in the methods like getTrainFoldReader or getTestFoldClassifierEngines because it makes it clearer to me that these methods are used for cross validations.

bethard commented 9 years ago

Comment #10 originally posted by ClearTK on 2011-02-09T16:12:03.000Z:

(1) I am fine with returning an array

(2) fine by me. I like the evaluation factory. I think the evaluation pipeline is decidedly different and separate from the training/classifying pipeline. My evaluation pipeline for my coordination task has 3 AEs in it. I also have two approaches for coordination that have a completely different set of AEs. So, I do not want to deal with cross-product issues either.

(3) No. The EngineFactory was refactored in the patch I attached to comment 8. The methods do not return AEDs and they do not take a directory.

(4) Preprocessing is specific to the corpus you are dealing with. If you are using XMI files, then you typically don't need this method because you have done all your preprocessing ahead of time. However, if you are dealing with some native file format for some annotated corpus, you will likely want to run some preprocessing annotators that, for example, parse XML (or PTB, etc.) into the annotations that are needed downstream. Corpora that ultimately provide the exact same kinds of annotations can come in many formats....

(5) I knew we would disagree about this one. I think that the number of folds should be determined by the corpus factory regardless of whether this is decided at runtime (this could be accomplished by an initialization parameter) or hard-wired in (for corpora where the folds are fixed.) Not every corpus is easily split into an arbitrary n number of folds. I don't understand why EngineFactory needs to know how many folds there are.

(6) In the patch I added to comment 8 I created a "complete" method for a given fold which is implemented by the EngineFactory. In your solution this method would call batchProcessComplete there. There is also an aggregateResults method where you would call collectionProcessComplete. I haven't decided exactly how I will use these two methods for the case where you write out evaluation data to be read in later for aggregating. I will say that after scrutinizing your crossValidate method in comment three - I do not think your use of batchProcessComplete and collectionProcessComplete is very consistent or intuitive. Why should a user call collectionProcessComplete for the training engines for each fold but call batchProcessComplete for the testing engines for the same folds? This is very confusing I think. This is another reason to abstract away these calls and put them in the factory as I have done here.

(7) I like "corpus" because it communicates a specific set of data that generally has a name and specific properties. How about CorpusReaderFactory? I wasn't consistent with my naming - I used runName and foldName where I meant to use runName consistently. I'm ambivalent about this. I am confused about what the name should be because there is no "fold" per se when you run holdout evaluation.

I am not crazy about "EngineFactory" - I would be happy to hear other suggestions.

bethard commented 9 years ago

Comment #11 originally posted by ClearTK on 2011-02-09T17:33:09.000Z:

(3) Yep, I misread the diff. They don't take a directory any more - they take a name string. The assumption is that the AnalysisEngine pipeline should never need to know whether it's being used for regular training or for cross validation training. I was trying not to make any assumption like that, but I guess it's not a crazy assumption to make. So I guess I'm okay with this. (Note that this answers part of your (5) question, why EngineFactory took the fold number.)

(4) I'm not really sold on the 1 corpus = 1 preprocessor. How I use the corpus depends on the preprocessing - if I want to use Penn Treebank to train a part of speech tagger, I may not want to load a bunch of trees. I think the preprocessing is pipeline dependent, not corpus dependent. But I guess it doesn't really hurt as long as it gets some better documentation explaining when you would put something with the corpus reader factory and when you would put something with the EngineFactory.

(5) But why not just throw an exception if you get a number of folds that is invalid for your corpus? You have to do that somewhere anyway. I guess I've just never worked with a corpus before that had set cross-validation folds, so conceptually it seems like the wrong place to put that.

(6a) "I created a "complete" method for a given fold which is implemented by the EngineFactory". You mean EvaluationFactory, right? So for my use case, how does the EvaluationFactory get access to the test engine to call batchProcessComplete in "complete" and collectionProcessComplete in "aggregateResults"?

(6b) "Why should a user call collectionProcessComplete for the training engines for each fold but call batchProcessComplete for the testing engines for the same folds?" Because you want to train N classifiers but you only want 1 final evaluation number. So there should be N collectionProcessComplete calls for training, and 1 collectionProcessComplete call for testing. But we don't have to agree on this as long as we support both approaches.

(7) "CorpusReaderFactory" - better for sure, though not amazing. ;-) "runName" - I think you should just call it "name" or "id" - it's just an arbitrary identifier made up by Evaluation, so reading anything more into it would be a mistake. "EngineFactory" - yeah, I didn't like this either because it has a "train" method.

bethard commented 9 years ago

Comment #12 originally posted by ClearTK on 2011-02-09T18:17:18.000Z:

(4) You can imagine that you might have a PTBCorpusReader_ImplBase from which two implementations that have different preprocess() implementations.

(5) My preference is to split my corpus up into fixed folds for maximum repeatability. So, that's what I've always done. It seems that some FileCorpusReader_ImplBase could be initialized with the number of folds at runtime and you could push the logic you had in configureForTesting in comment 3 into that class.

(6a) my mistake. Yes - the evaluation factory

(6b) this is confusing.

(7) yes - 'name' is better. One candidate might be CleartkAnnotatorFactory - it seems a bit narrow for what it does (considering it has a train and could have any number of preprocessing AEs for the target CleartkAnnotator) - but it communicates better than EngineFactory. Open for suggestions.

Are we still disagreeing on anything? I think I will commit my diff since it seems to be closer to a solution than the current code and we can go from there.

bethard commented 9 years ago

Comment #13 originally posted by ClearTK on 2011-02-09T21:18:54.000Z:

Yes, please commit your current diff. And if it's not too much work, move things to cleartk-eval with maybe version 0.7.0 or something like that? I think we're converging to a good solution here, but we should definitely give it a bit of time and experience before we mark it 0.9.9.

For (5), how about we have a getNumberOfFolds and setNumberOfFolds? That would make it feel less like there was a single "correct" number of folds for a given corpus, and you could put in setNumberOfFolds any code you need for checking that you've received a valid number of folds.

For (6), I'm still not sure how my EvaluationFactory is supposed to get to the AnalysisEngines to call collectionProcessComplete and/or batchProcessComplete. This is the one serious problem remaining for me.

For (7), I'm not convinced yet that CleartkAnnotatorFactory is better than EngineFactory, so I guess just leave it for the moment.. I'll try to brainstorm for better names.

bethard commented 9 years ago

Comment #14 originally posted by ClearTK on 2011-02-09T22:28:48.000Z:

(repeated from email I meant to post as a comment)

I committed the code earlier today.

I will also create the new project and move it over.

(5) sounds good.

(6) look at how Evaluation.java calls EvaluationFactory.writeResults() and EvaluationFactory.aggregateResults(). (I just renamed writeResults in a commit from a couple minutes ago.)

(7) I renamed EngineFactory to CleartkAnnotatorFactory before I committed the code. I will leave it that way until we come up with a better name. Given that everything called "Factory" has no "create" methods anymore we should stop calling them factories. I described it to Chris and he suggested "Cache". Maybe "Provider" would work. I sort of like that better. Here's a straw man:

CorpusReaderProvider CleartkPipelineProvider EvaluationProvider

bethard commented 9 years ago

Comment #15 originally posted by ClearTK on 2011-02-09T22:33:40.000Z:

(5) Since I have a corpus of tutorial dialogue transcripts, I was thinking that instead of n-fold cross-validation, I would be doing leave-one-student-out cross-validation. How does that play with this situation? Is it the same as when I have pre-specified folds?

bethard commented 9 years ago

Comment #16 originally posted by ClearTK on 2011-02-09T23:05:53.000Z:

@Philip:

(6) I still don't understand how I can implement EvaluationFactory.writeResults() so that it calls batchProcessComplete() - the EvaluationFactory doesn't have any access to the pipeline, only the EngineFactory has access to the pipeline. Or are you suggesting that I write my own Evaluation.java that calls batchProcessComplete() instead of EvaluationFactory.writeResults() and collectionProcessComplete() instead of EvaluationFactory.aggregateResults()? (Hopefully not - I think we should arrange the API so that Evaluation.java works for all our use cases.)

(7) +0.5 on all the Provider names - definitely the best I've seen from what we had so far.

@Lee

(5) If you know you'll only ever do leave-one-out cross-validation on that corpus, then define getNumberOfFolds() to return the number of transcripts and setNumberOfFolds() to throw UnsupportedOperationException. If you think you might do other types of cross-validation, but you want to make it easy to do leave-one-out cross-validation when you want to, then you might add to that specific CorpusReaderProvider a method like getNumberOfTranscripts() which you can then pass to setNumberOfFolds().

bethard commented 9 years ago

Comment #17 originally posted by ClearTK on 2011-02-09T23:07:14.000Z:

as long as you know how many "instances" you have then it should be easy to specify the number of folds for leave-one-out. Still, it seems that there should be an ImplBase that does this for you. This should be possible with the current interface.

bethard commented 9 years ago

Comment #18 originally posted by ClearTK on 2011-02-09T23:14:08.000Z:

(5) I should clarify. I am doing dialog move classification on the turns within a transcript. Also a given student can have multiple transcripts, consequently my folds will be designed to ensure I'm not training and testing on the utterances of a student. In my case leave-one-out is not leave-one-instance out but leave all of a student's utterance instances out.

bethard commented 9 years ago

Comment #19 originally posted by ClearTK on 2011-02-09T23:16:19.000Z:

@Steve: (6) right. I forgot that you are going to put your evaluation AEs in the list of analysis engines returned by (current working name) CleartkPipelineProvider. This would work if you would be willing to implement an (current working name) EvaluationPipelineProvider. The idea is that your implementation of EvaluationPipelineProvider would call batchProcessComplete in writeResults on those analysis engines that that need to write evaluation data.

Let me fix up the code that I'm working on assuming you would be willing to do this so that we have a version we can look at with the current names and some of the additional changes above. If you can't warm up to the notion of using the EvaluationPipelineProvider, then we will have to figure out how to give you the hooks that you need.

bethard commented 9 years ago

Comment #20 originally posted by ClearTK on 2011-02-09T23:29:46.000Z:

Yep, fix up the code and I'll take a look at it tomorrow.

I was already assuming i would implement a EvaluationPipelineProvider to call batchProcessComplete and collectionProcessComplete (which I'm fine with implementing) but I'm still not clear on how that EvaluationPipelineProvider would get access to the pipeline that was being finished because writeResults takes only a String parameter, not an AnalysisEngine or AnalysisEngine[] parameter.

I'm pretty sure that even for the other use cases, your code isn't even calling collectionProcessComplete() for the "classifierAggregate". (It is for the "dataWritingAggregate" implicitly through SimplePipeline.runPipeline though.) At first I though that was because you intended to call collectionProcessComplete() from evaluationFactory.writeResults() but that's not possible with the current API.

(Also, we probably shouldn't use JCasIterable or SimplePipeline.runPipeline in Evaluation.java - these are convenience methods that swallow some UIMA exceptions and turn them into RuntimeExceptions, which is fine for a simple main method, but probably shouldn't be used in other code. We could fix SimplePipeline.runPipeline to not use JCasIterable, and perhaps we should do that, but there's no real way to fix JCasIterable, so we should just avoid using it for anything but simple main methods.)

bethard commented 9 years ago

Comment #21 originally posted by ClearTK on 2011-02-10T01:06:28.000Z:

I have committed all the changes that I have for now. Please scrutinize.

I think I am both confused and confusing. Because EvaluationPipelineProvider produced a list of analysis engines for a given name it can call batchProcessComplete (as it sees fit) on the same analysis engines (since it is given the name.) However, callers of aggregateResults will have to be aware that the EvaluationPipelineProvider will aggregate results for whatever pipelines it has served up - i.e. you wouldn't want to run the holdout evaluation and then cross validation because then the aggregation will include evaluation data from both. Maybe we should have a reset() method of some sort?

I cleaned up the code a bit so that it isn't calling SimplePipeline any more and it is clear where I am calling collectionProcessComplete and where I'm not. That said, the code is wrong and I am not sure what to do. I rather think that we shouldn't be calling collectionProcessComplete on any of the analysis engines and that we should send this back to the providers to do as they see fit. So, this makes me think that CleartkPipelineProvider needs methods analogous to writeResults() and aggregateResults(). It would be nice if the methods were the same for both. Maybe pipelineComplete(name) and evaluationComplete()?

bethard commented 9 years ago

Comment #22 originally posted by ClearTK on 2011-02-10T01:08:31.000Z:

I just noticed a potential hole in our CleartkPipelineProvider. The train method accepts a name and a single set of training arguments. This assumes that there is a single model being built, or that all the models being built will take the same training arguments. This is fine for many scenarios (both of which I've found myself in.) But I can imagine that for Lee's use - he is going to want to train different models using different learners each with different training arguments to evaluate his system. This is true not just for CleartkMultiAnnotator but also for cases where a pipeline has multiple CleartkAnnotators. I suppose since he will be implementing the train method for his own evaluation infrastructure he can specify the strings to be whatever he wants them to be - perhaps using Args4j to parse argument variables.

bethard commented 9 years ago

Comment #23 originally posted by ClearTK on 2011-02-10T01:19:52.000Z:

While I would love the flexibility to tweak each classifier, it doesn't seem practical to be tweaking the learner parameters for 30+ different classifiers. As it stands now, CleartkMultiAnnotator only takes one learner, and we would need a more sophisticated mechanism to pair each classifier with a different learning algorithm.

FWIW, I will probably focus more on experiments (in the near term at least) where I turn sets of features off globally for all of my classifiers.

bethard commented 9 years ago

Comment #24 originally posted by ClearTK on 2011-02-10T12:22:45.000Z:

Ok, I see how it works now. EvaluationPipelineProvider.getEvaluationPipeline doesn't provide the whole testing + results pipeline, just the evaluation bit at the end.

I tried to update my code to use the new interfaces. Some problems I ran into:

= get_Reader call order = Evaluation is currently broken for my use case because it calls getTrainReader, then getTestReader, and then runs pipelines. But since my train and test readers are the same reader, just reconfigured, the train pipeline actually gets the test reader files, and the test pipeline gets no CASes at all because they were used up during training. The rule should be "never call get_Reader until the moment you're actually going to use it". Basically, we need to get rid of Evaluation.run, which violates this assumption.

= Fold Numbering = runCrossValidation assumes that folds are numbered starting with 1. This seems inconsistent with standard Java conventions. Can we have 0 based indexing here?

= Declared Exceptions = We need to think about what exceptions all the methods should throw. For example, for my use case, I'm typically calling ConfigurableResource.reconfigure() in all the CorpusReaderPipeline methods and the the CleartkPipelineProvider.get*Pipeline methods, and that throws a ResourceConfigurationException, not a ResourceInitializationException. As another example, in EvaluationPipelineProvider.writeResults, I'm calling batchProcessComplete, and that throws a AnalysisEngineProcessException. Perhaps all the methods should be throwing UIMAException? The one exception would be org.cleartk.eval.provider.CleartkPipelineProvider.train which should throw Exception to match JarClassifierBuilder.trainClassifier.

= Evaluation.buildCorpusModel = I think Evaluation.buildCorpusModel still has the wrong signature. Why is there a File there? I think it should just be a String id (a.k.a. runName) like everywhere else. Also, Evaluation.buildCorpusModel shouldn't be calling dataWritingPipeline.add - we shouldn't be modifying the list returned by getTrainingPipeline.

= JarClassifierBuilder.trainAndPackage = I don't think Evaluation.train should delegate to Train.main. Instead, it should just use the following code:

JarClassifierBuilder<?> classifierBuilder = JarClassifierBuilder.fromTrainingDirectory(modelDirectory);
classifierBuilder.trainClassifier(modelDirectory, trainingArguments);
classifierBuilder.packageClassifier(modelDirectory);

Perhaps instead of having Evaluation.train, we should add a static method to JarClassifierBuilder? I mean, if we're going to assume that it's a JarClassifierBuilder directory (as Train.main does), then JarClassifierBuilder is probably the right place for that method.

= collectionProcessComplete = "CleartkPipelineProvider needs methods analogous to writeResults() and aggregateResults() ... Maybe pipelineComplete(name) and evaluationComplete()?" Yes, this sounds good to me, but in your use case you don't keep a reference to the training and evaluation pipelines around, so you won't be able to call collectionProcessComplete from just a String identifier. So I think the signatures need to be:

void trainingPipelineComplete(String, List<AnalysisEngine>)
void classifyingPipelineComplete(String, List<AnalysisEngine>)
void evaluationPipelineComplete(String, List<AnalysisEngine>)

These methods would be called every time you go through a pipeline, e.g. once for a train/test run, and once for each fold for a cross-validation run. We'd also need the methods:

void trainingComplete()
void classifyingComplete()
void evaluationComplete()

These methods would be called exactly once, at the end of runCrossValidation(), runHoldoutEvaluation() and buildCorpusModel(). I think that should work for both of our use cases:

For training and classifying, I would call collectionProcessComplete() in trainingPipelineComplete() and classifyingPipelineComplete(). For evaluation, I would call batchProcessComplete() in evaluationPipelineComplete() and collectionProcessComplete() in evaluationComplete().
You would call collectionProcessComplete() in all the xxxPipelineComplete() methods, and do whatever you used to do in aggregateResults() in evaluationComplete().

Attached is a patch which makes all the changes proposed above.

bethard commented 9 years ago

Comment #25 originally posted by ClearTK on 2011-02-10T18:22:56.000Z:

= get*Reader call order = My bad. Looks better now.

= Fold Numbering = true. I wanted to have the folder name of the first fold to be "fold-1" instead of "fold-0". This can easily be done in createFoldName (which has a minor bug related to this issue - either the digit that is used for the file name needs to be incremented or the total folds needs to be decremented.) I will fix the method. Btw - I asked you for that string formatting trickiery a few weeks ago and you didn't know what I was talking about.... ;)

= Declared Exceptions = sounds good.

= Evaluation.buildCorpusModel = I forgot to revisit this method. looks better now.

I agree with the other points you make above.

Hopefully, we haven't worked too hard to accommodate our two use cases. The code doesn't look all that complicated - so hopefully the abstraction is good. I will have a better feeling about it after I port my evaluation code to this. We definitely need to document this code extensively before closing this issue.

I am refactoring XmiCorpusFactory so that it doesn't assume fixed folds. I will commit this today.

bethard commented 9 years ago

Comment #26 originally posted by ClearTK on 2011-02-10T18:48:28.000Z:

Yep, let me know how your port goes. If everything works out with your port, we'll go for a documentation spree on these interfaces.

"createFoldName (which has a minor bug related to this issue - either the digit that is used for the file name needs to be incremented or the total folds needs to be decremented.)" Are you sure? It looks right to me:

scala> import org.cleartk.eval.Evaluation.createFoldName import org.cleartk.eval.Evaluation.createFoldName

scala> for (i <- 0 until 9) yield createFoldName(i, 9)
res1: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(0, 1, 2, 3, 4, 5, 6, 7, 8)

scala> for (i <- 0 until 10) yield createFoldName(i, 10) res2: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> for (i <- 0 until 11) yield createFoldName(i, 11) res3: scala.collection.immutable.IndexedSeq[java.lang.String] = Vector(00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10)

But yes, I'm fine with you modifying it so that it numbers from 1 instead of 0 when it creates the name.

"I asked you for that string formatting trickiery a few weeks ago and you didn't know what I was talking about". Actually you asked me about "creating a string of n repeating characters" which is different from asking to fill in zeros before an integer. There's a special string format operation for prefixing zeros (e.g. %04d means 4 digits, filling with zeros as necessary) while there's no special string format operation for the general case.

bethard commented 9 years ago

Comment #27 originally posted by ClearTK on 2011-02-10T19:34:03.000Z:

When I run EvalationTest I get the following error:

java.util.DuplicateFormatFlagsException: Flags = '0' at java.util.Formatter$Flags.parse(Formatter.java:4140) at java.util.Formatter$FormatSpecifier.flags(Formatter.java:2555) at java.util.Formatter$FormatSpecifier.(Formatter.java:2625) at java.util.Formatter.parse(Formatter.java:2480) at java.util.Formatter.format(Formatter.java:2414) at java.util.Formatter.format(Formatter.java:2367) at java.lang.String.format(String.java:2769) at org.cleartk.eval.Evaluation.createFoldName(Evaluation.java:114) at org.cleartk.eval.EvaluationTest.testCreateFoldName(EvaluationTest.java:43)

bethard commented 9 years ago

Comment #28 originally posted by ClearTK on 2011-02-10T19:58:41.000Z:

Did you mean to name org.cleartk.eval.provider.CorpusReaderPipeline org.cleartk.eval.provider.CorpusReaderProvider?

bethard commented 9 years ago

Comment #29 originally posted by ClearTK on 2011-02-11T10:47:18.000Z:

@Comment 27 I assume you were passing in either 0 or 1 total folds? Shouldn't that method just throw an exception for 0 or 1 total folds? (A different, more informative exception, of course.)

@Comment 28: Yeah, I didn't rename anything you hadn't renamed, but I agree that should be called Provider.

bethard commented 9 years ago

Comment #30 originally posted by ClearTK on 2011-04-28T10:02:08.000Z:

Fixed in r2445, with renamings etc. up through r2818.