ClearTK / cleartk

Machine learning components for Apache UIMA
http://cleartk.github.io/cleartk/
Other
129 stars 58 forks source link

Multiple Classifier CleartkAnnotator #224

Closed bethard closed 9 years ago

bethard commented 9 years ago

Original issue 226 created by ClearTK on 2011-02-03T18:31:48.000Z:

[Lee]

I was thinking since I have to do it anyway for my work, I should go ahead and write a CleartkMultiAnnotator class. The question is what can one reasonably expect from such a class? It looks like CleartkAnnotator doesn't provide much save for initialization of the classifier and data writer, and I was thinking CleartkMultiAnnotator would just provide help in dealing with collections of classifiers/ data writers. Based on conversations with Philip and my own needs, I was thinking that the pieces that differentiate CleartkMultiAnnotator from CleartkAnnotator would look something like this:

public abstract class CleartkAnnotator<KEY_TYPE, OUTCOME_TYPE> extends JCasAnnotator_ImplBase implements Initializable {

 protected Map&lt;KEY_TYPE, Classifier&lt;OUTCOME_TYPE&gt; classifiers;
 protected Map&lt;KEY_TYPE, DataWriter&lt;OUTCOME_TYPE&gt; dataWriters;

 // Retrieves or creates classifier as needed
 getClassifier(KEY_TYPE key);

// Retrieves or creates dataWriter as needed
getDataWriter(KEY_TYPE key);

}

Can you think of anything else that would be useful? I know when Philip implemented something similar for his own purposes he had an output directory configuration parameter so that he could squirrel away the classifiers and data writers in jar files.

[Steve] Thanks, this sounds like a great contribution!

This looks like basically what I had in mind, except I would just drop the generics for KEY_TYPE and use String everywhere. Then you can have a String[] PARAM_CLASSIFIER_NAMES @ConfigurationParameter, and the DataWriters can all delegate to PARAM_OUTPUT_DIRECTORY/name. Within those subdirectories, you should be able to package and unpackage jars using the usual ClearTK machinery.

By the way, I think you probably don't need the getter methods if the Maps are protected. No one but subclasses should really be asking for the DataWriters and Classifiers here, right?

[Philip] The getters can be protected. In my solution, I used getClassifier(keyType) to get a classifier if it was already available. If not, then it would initialize the classifier. I like doing this because you may not need every classifier that's been created for/from the training data when you are running it on your evaluation data because you may (perhaps inadvertently) have some key types that are very sparse and never appear in your evaluation data.

I agree that KEY_TYPE may be unnecessary and String may suffice. Although, I can see that a boolean would be quite reasonable for some scenarios.

I would rather not use the word "key" or "type" for the naming the key of the map as they are both overloaded terms. Maybe "name"?

Finally, it would be nice if we were able to abstract out the output directory as a parameter to this analysis engine as we do with CleartkAnnotator. I don't want to wire in our jar-based solution into this new class.

[Lee] I'm leaning toward just using String like Steve suggested. If someone really needed the boolean condition they could just create appropriate strings to describe their classifiers.

As per the getters, I go back and forth on whether or not they are needed. They are handy for the on-the-fly creation like Philip described, but they also muddle the semantics when using these classifier/datawriter maps. It's not clear when writing an extension of this proposed class whether one should use the maps directly or should only access them through the getters.

Lastly, does anyone have a good name for this class?

bethard commented 9 years ago

Comment #1 originally posted by ClearTK on 2011-02-03T18:41:07.000Z:

I like the getters/setters and am not worried about whether or not an extension is going to directly access the map or not (we could make the map private, too.) I don't like the idea that we have to initialize all of the data writers and classifiers in the initialize method. If you train on 10M instances (maybe you are doing something semi-supervised) and test on 1K, then you might find yourself loading 100 classifiers that are never used for your 1000 test instances.

One possible name is org.cleartk.classifier.multi.CleartkAnnotator (note package name). Or org.cleartk.classifier.multi.MultiCleartkAnnotator. Either way, I think it makes sense to put this code in its own package.

bethard commented 9 years ago

Comment #2 originally posted by ClearTK on 2011-02-03T20:08:33.000Z:

I like this package/name combination best org.cleartk.classifier.multi.MultiCleartkAnnotator so that people don't look at examples and confuse CleartkAnnotator with this new code.

To do this correctly, it looks like I will probably need to write a new JarClassifierFactory that accepts a output directory and a name instead of just a single path variable. Would these companion factories go in org.cleartk.classifier.multi, org.cleartk.classifer.jar or some new place? Also, I will need to add a createClassifier(String name) method to some ClassifierFactory interface. Do I add to the existing ClassifierFactory interface or spin my own MultiClassifierFactory?

bethard commented 9 years ago

Comment #3 originally posted by ClearTK on 2011-02-03T20:58:43.000Z:

I agree that it is nice to have different names.

I think a new MultiClassifierFactory should go in the 'multi' package.

Your new JarClassifierFactory should probably go in a multi.jar package - but should still reuse as much as possible from the jar package.

bethard commented 9 years ago

Comment #4 originally posted by ClearTK on 2011-02-03T20:59:48.000Z:

I agree that it is nice to have different names.

I think a new MultiClassifierFactory should go in the 'multi' package.

Your new JarClassifierFactory should probably go in a multi.jar package - but should still reuse as much as possible from the jar package.

bethard commented 9 years ago

Comment #5 originally posted by ClearTK on 2011-02-04T04:33:26.000Z:

I am in the process of converting CleartkAnnotatorTest into unit tests for the CleartkMultiAnnotator. However, I'm unsure about what to do with the test known as testDescriptor().

In CleartkAnnotatorTest, it receives ResourceInitializationExceptions that are thrown by the initialize() method because of missing a missing output directory or classifier jar file. However with CleartkAnnotatorTest, the dataWriters and classifiers don't get created nor initialized until something calls getClassifier(name) or getDataWriter(name).

Should I be trying to manually check these conditions in the initialize() method? If so, how would I do that?

Alternatively, I can alter the unit test to check that this exception is thrown after a call to getClassifier or getDataWriter.

bethard commented 9 years ago

Comment #6 originally posted by ClearTK on 2011-02-05T05:10:08.000Z:

As I continue to plow down this path, I'm hitting some road blocks with when it comes to passing in a dataWriterFactory. I wrote my own class derived from CleartkMultiAnnotator and specified the DefaultBinaryMalletDataWriterFactory for my DataWriterFactory, only to find that all of my instances were getting put into a single training-data.mallet file.

While I can specify the PARAM_OUTPUT_DIRECTORY as a configuration parameter, I have no way to update this within the getDataWriter() method as there is no method within DataWriterFactory to change the pre-specified output directory.

I believe this is less of an issue with classifiers, because I wrote a new class called JarMultiClassifierFactory with a method createClassifier(name), which can specify the path as needed. I could probably do something similar and write a MultiDataWriterFactory, but then I would need to write a Multi version of every DataWriterFactory. Alternatively, I could make it use a DirectoryDataWriterFactory, but that would prevent CleartkMultiAnnotator from working with several of the other types of DataWriters.

Hopefully one of you true ClearTK sages can point me in the right direction.

bethard commented 9 years ago

Comment #7 originally posted by ClearTK on 2011-02-05T09:16:40.000Z:

I see two solutions, neither of them perfect.

You could do something like what ViterbiDataWriterFactory does - cast its UimaContext to a UimaContextAdmin, set the output directory for the data writer you're delegating to, and then restore the original output directory when you're done. This is a little hacky because you aren't really supposed to use the UimaContextAdmin interface.

Alternatively, you could add getOutputDirectory and setOutputDirectory methods to DirectoryDataWriter. Then in CleartkMultiAnnotator, you'd just cast your DataWriter to DirectoryDataWriter and set the output directory that way. This is a little hacky because the cast means it won't work with general DataWriters (though it will probably work with all the ones you care about...)

bethard commented 9 years ago

Comment #8 originally posted by ClearTK on 2011-02-05T19:53:31.000Z:

Not really able to discern which is the lesser of two evils, I am inclined to use the second solution, but I think it's even hackier than you described. For many dataWriters, much of the configuration takes place in the constructor. Even if I was to add a setOutputDirectory() method, the call would occur too late to enact any change. Playing with the code a bit, I've found that I can do an unchecked cast on my DataWriterFactory to a DirectoryDataWriter and then call its setOutputDirectory prior to making calls to createDataWriter().

Again this approach is hacky because it means the class won't work with general DataWriterFactories and consequently general DataWriters.

bethard commented 9 years ago

Comment #9 originally posted by ClearTK on 2011-02-06T09:37:40.000Z:

I thought about this a bit more, and I'm a little less worried about it not working with general DataWriterFactories because the moment you refer to PARAM_OUTPUT_DIRECTORY (which you have to do for your use case), you're already assuming a DirectoryDataWriterFactory. So I think casting to one is fine. Just document somewhere that this code assumes a DirectoryDataWriterFactory or subclass.

bethard commented 9 years ago

Comment #10 originally posted by ClearTK on 2011-02-07T17:07:45.000Z:

I do not see any reason that CleartkMultiAnnotator has to know any thing about the data writer factory except that it returns a data writer for a given type/label/name for the map of data writers. The data writer factory for CleartkMultiAnnotator is going to be a new interface with one method:

public DataWriter createDataWriter(String name)

The implementation should be able to delegate to existing data writer factories by updating the output directory. That is, go ahead and add setOutputDirectory to the DirectoryDataWriter - but don't have CleartkMultiAnnotator call this method - have the MultiAnnotatorDataWriterFactory (or whatever it is called) do this. This way MultiCleartkAnnotator is not coupled to DirectoryDataWriter.

bethard commented 9 years ago

Comment #11 originally posted by ClearTK on 2011-02-07T17:23:08.000Z:

I'm not sure writing a MultiDataWriterFactory is necessarily the right approach. Currently the dataWriterFactory is passed in as a configuration parameter to CleartkMultiAnnotator. If I was to create a separate MultDataWriterFactory, I would have to create a Multi version for every DataWriterFactory, thus to use the DefaultBinaryMalletDataWriterFactory, I would have to create a DefaultBinaryMalletMultiDataWriterFactory that inherits from MultiDataWriterFactory.

bethard commented 9 years ago

Comment #12 originally posted by ClearTK on 2011-02-07T17:23:53.000Z:

Hmm.... it just occurred to me that I could create a MultiDataWriterFactory that took as a regular DataWriterFactory class as a configuration parameter. Does that make sense? Is it possible to chain factories like that?

bethard commented 9 years ago

Comment #13 originally posted by ClearTK on 2011-02-07T17:30:39.000Z:

"The implementation should be able to delegate to existing data writer factories by updating the output directory" - use of PARAM_OUTPUT_DIRECTORY means that the MultiAnnotatorDataWriterFactory implementation will have to assume that the existing data writer factories have a PARAM_OUTPUT_DIRECTORY (i.e. that they're a subclass of DirectoryDataWriter). So you've just moved the assumption from MultiCleartkAnnotator to MultiAnnotatorDataWriterFactory. That's okay with me - but we should clearly document this assumption somewhere, regardless of whether it ends up in MultiCleartkAnnotator or MultiAnnotatorDataWriterFactory.

bethard commented 9 years ago

Comment #14 originally posted by ClearTK on 2011-02-07T17:57:58.000Z:

re 12 - yes, that makes sense - exactly what I was thinking. Although I don't think of it as "chaining" - I think of it as delegation.

I am fine with MultiAnnotatorDataWriterFactory being coupled with DirectoryDataWriter - much preferred over having MultiCleartkAnnotator.

This really highlights the fact that what we should be careful naming these classes. There's a few different names used above. Here is my suggestion - name the interface MultiDataWriterFactory and name its one implementation MultiDirectoryDataWriterFactory.

bethard commented 9 years ago

Comment #15 originally posted by ClearTK on 2011-02-07T23:15:26.000Z:

The code is checked into revision 2430. Thanks for all the help and feedback.

bethard commented 9 years ago

Comment #16 originally posted by ClearTK on 2011-02-08T00:14:11.000Z:

Lee,

This is really great. I have immediate use for this as I port my coordination code to cleartk-syntax-coordination.

I have reopened this issue to address a few things:

1) You have touched on some of the real ugliness of Java generics in this (do you feel diseased!?) I get this stuff messed up all the time. I don't think you actually made a mistake, per se - but I would suggest making the member variables multiDataWriterFactory and multiClassifierFactory typed with OUTCOME_TYPE instead of '?'. See the attached. This causes one test to fail - but that's because the test is wrong with this version.

2) We should revisit what exception getClassifier and getDataWriter should throw. ResourceInitializationException seems conceptually correct. However, it will always be called inside a subclass preprocess method and so it will have to get caught and an AnalysisEngineProcessException thrown. So, I would vote that these methods throw AnalysisEngineProcessException.

3) What happens if the data writer factory is instantiated but getClassifier() is called? I suppose it blows up because a NullPointerException is thrown. I guess that is fine by me. Still, we might consider throwing an exception instead to ease debugging.

That's all I noticed. I only looked at CleartkMultiAnnotator - so I have no critique of the other classes yet.

bethard commented 9 years ago

Comment #17 originally posted by ClearTK on 2011-02-08T00:17:07.000Z:

I added license info to two files in r2431.

bethard commented 9 years ago

Comment #18 originally posted by ClearTK on 2012-02-12T17:05:50.000Z:

<empty>

bethard commented 9 years ago

Comment #19 originally posted by ClearTK on 2012-02-12T22:01:42.000Z:

I think we may need to address comment # 16 before closing this issue. I'm reopening so that it won't get lost....

bethard commented 9 years ago

Comment #20 originally posted by ClearTK on 2012-07-24T17:58:12.000Z:

<empty>

bethard commented 9 years ago

Comment #21 originally posted by ClearTK on 2012-07-31T22:07:35.000Z:

We should decide whether to keep this. If it's not useful, we may want to deprecate it.

bethard commented 9 years ago

Comment #22 originally posted by ClearTK on 2012-08-06T19:38:00.000Z:

Fixed in r3968.