fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

Multiple Classifier CleartkAnnotator #226

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
[Lee]

I was thinking since I have to do it anyway for my work, I should go
ahead and write a CleartkMultiAnnotator class.  The question is what
can one reasonably expect from such a class?  It looks like
CleartkAnnotator doesn't provide much save for initialization of the
classifier and data writer, and I was thinking CleartkMultiAnnotator
would just provide help in dealing with collections of classifiers/
data writers.   Based on conversations with Philip and my own needs, I
was thinking that the pieces that differentiate CleartkMultiAnnotator
from CleartkAnnotator would look something like this:

public abstract class CleartkAnnotator<KEY_TYPE, OUTCOME_TYPE> extends
JCasAnnotator_ImplBase implements
     Initializable {

     protected Map<KEY_TYPE, Classifier<OUTCOME_TYPE> classifiers;
     protected Map<KEY_TYPE, DataWriter<OUTCOME_TYPE> dataWriters;

     // Retrieves or creates classifier as needed
     getClassifier(KEY_TYPE key);

    // Retrieves or creates dataWriter as needed
    getDataWriter(KEY_TYPE key);

}

Can you think of anything else that would be useful?  I know when
Philip implemented something similar for his own purposes he had an
output directory configuration parameter so that he could squirrel
away the classifiers and data writers in jar files.

[Steve]
Thanks, this sounds like a great contribution!

This looks like basically what I had in mind, except I would just drop
the generics for KEY_TYPE and use String everywhere. Then you can have
a String[] PARAM_CLASSIFIER_NAMES @ConfigurationParameter, and the
DataWriters can all delegate to PARAM_OUTPUT_DIRECTORY/name. Within
those subdirectories, you should be able to package and unpackage jars
using the usual ClearTK machinery.

By the way, I think you probably don't need the getter methods if the
Maps are protected. No one but subclasses should really be asking for
the DataWriters and Classifiers here, right?

[Philip]
The getters can be protected.  In my solution, I used getClassifier(keyType) to 
get a classifier if it was already available. If not, then it would initialize 
the classifier.  I like doing this because you may not need every classifier 
that's been created for/from the training data when you are running it on your 
evaluation data because you may (perhaps inadvertently) have some key types 
that are very sparse and never appear in your evaluation data.  

I agree that KEY_TYPE may be unnecessary and String may suffice.  Although, I 
can see that a boolean would be quite reasonable for some scenarios.  

I would rather not use the word "key" or "type" for the naming the key of the 
map as they are both overloaded terms.  Maybe "name"?  

Finally, it would be nice if we were able to abstract out the output directory 
as a parameter to this analysis engine as we do with CleartkAnnotator.  I don't 
want to wire in our jar-based solution into this new class.  

[Lee]
I'm leaning toward just using String like Steve suggested.  If someone
really needed the boolean condition they could just create appropriate
strings to describe their classifiers.

As per the getters, I go back and forth on whether or not they are
needed.   They are handy for the on-the-fly creation like Philip
described, but they also muddle the semantics when using these
classifier/datawriter maps.  It's not clear when writing an extension
of this proposed class whether one should use the maps directly or
should only access them through the getters.

Lastly, does anyone have a good name for this class?

Original issue reported on code.google.com by phi...@ogren.info on 3 Feb 2011 at 6:31

GoogleCodeExporter commented 9 years ago
I like the getters/setters and am not worried about whether or not an extension 
is going to directly access the map or not (we could make the map private, 
too.) I don't like the idea that we have to initialize all of the data writers 
and classifiers in the initialize method.  If you train on 10M instances (maybe 
you are doing something semi-supervised) and test on 1K, then you might find 
yourself loading 100 classifiers that are never used for your 1000 test 
instances.  

One possible name is org.cleartk.classifier.multi.CleartkAnnotator (note 
package name).  Or org.cleartk.classifier.multi.MultiCleartkAnnotator.  Either 
way, I think it makes sense to put this code in its own package.  

Original comment by phi...@ogren.info on 3 Feb 2011 at 6:41

GoogleCodeExporter commented 9 years ago
I like this package/name combination best 
org.cleartk.classifier.multi.MultiCleartkAnnotator so that people don't look at 
examples and confuse CleartkAnnotator with this new code.

To do this correctly, it looks like I will probably need to write a new 
JarClassifierFactory that accepts a output directory and a name instead of just 
a single path variable.   Would these companion factories go in 
org.cleartk.classifier.multi, org.cleartk.classifer.jar or some new place?  
Also, I will need to add a createClassifier(String name) method to some 
ClassifierFactory interface.  Do I add to the existing ClassifierFactory 
interface or spin my own MultiClassifierFactory?

Original comment by lee.becker on 3 Feb 2011 at 8:08

GoogleCodeExporter commented 9 years ago
I agree that it is nice to have different names.  

I think a new MultiClassifierFactory should go in the 'multi' package.  

Your new JarClassifierFactory should probably go in a multi.jar package - but 
should still reuse as much as possible from the jar package.  

Original comment by phi...@ogren.info on 3 Feb 2011 at 8:58

GoogleCodeExporter commented 9 years ago
I agree that it is nice to have different names.  

I think a new MultiClassifierFactory should go in the 'multi' package.  

Your new JarClassifierFactory should probably go in a multi.jar package - but 
should still reuse as much as possible from the jar package.  

Original comment by phi...@ogren.info on 3 Feb 2011 at 8:59

GoogleCodeExporter commented 9 years ago
I am in the process of converting CleartkAnnotatorTest into unit tests for the 
CleartkMultiAnnotator.  However, I'm unsure about what to do with the test 
known as testDescriptor().

In CleartkAnnotatorTest, it receives ResourceInitializationExceptions that are 
thrown by the initialize() method because of missing a missing output directory 
or classifier jar file.  However with CleartkAnnotatorTest, the dataWriters and 
classifiers don't get created nor initialized until something calls 
getClassifier(name) or getDataWriter(name).  

Should I be trying to manually check these conditions in the initialize() 
method?  If so, how would I do that?

Alternatively, I can alter the unit test to check that this exception is thrown 
after a call to getClassifier or getDataWriter.

Original comment by lee.becker on 4 Feb 2011 at 4:33

GoogleCodeExporter commented 9 years ago
As I continue to plow down this path, I'm hitting some road blocks with when it 
comes to passing in a dataWriterFactory.  I wrote my own class derived from 
CleartkMultiAnnotator and specified the DefaultBinaryMalletDataWriterFactory 
for my DataWriterFactory, only to find that all of my instances were getting 
put into a single training-data.mallet file.

While I can specify the PARAM_OUTPUT_DIRECTORY as a configuration parameter, I 
have no way to update this within the getDataWriter() method as there is no 
method within DataWriterFactory to change the pre-specified output directory.

I believe this is less of an issue with classifiers, because I wrote a new 
class called JarMultiClassifierFactory with a method createClassifier(name), 
which can specify the path as needed.  I could probably do something similar 
and write a MultiDataWriterFactory, but then I would need to write a Multi 
version of every DataWriterFactory.  Alternatively, I could make it use a 
DirectoryDataWriterFactory, but that would prevent CleartkMultiAnnotator from 
working with several of the other types of DataWriters.

Hopefully one of you true ClearTK sages can point me in the right direction.

Original comment by lee.becker on 5 Feb 2011 at 5:10

GoogleCodeExporter commented 9 years ago
I see two solutions, neither of them perfect.

You could do something like what ViterbiDataWriterFactory does - cast its 
UimaContext to a UimaContextAdmin, set the output directory for the data writer 
you're delegating to, and then restore the original output directory when 
you're done. This is a little hacky because you aren't really supposed to use 
the UimaContextAdmin interface.

Alternatively, you could add getOutputDirectory and setOutputDirectory methods 
to DirectoryDataWriter. Then in CleartkMultiAnnotator, you'd just cast your 
DataWriter to DirectoryDataWriter and set the output directory that way. This 
is a little hacky because the cast means it won't work with general DataWriters 
(though it will probably work with all the ones you care about...)

Original comment by steven.b...@gmail.com on 5 Feb 2011 at 9:16

GoogleCodeExporter commented 9 years ago
Not really able to discern which is the lesser of two evils, I am inclined to 
use the second solution, but I think it's even hackier than you described.  For 
many dataWriters, much  of the configuration takes place in the constructor.  
Even if I was to add a setOutputDirectory() method, the call would occur too 
late to enact any change.  Playing with the code a bit, I've found that I can 
do an unchecked cast on my DataWriterFactory to a DirectoryDataWriter and then 
call its setOutputDirectory prior to making calls to createDataWriter().

Again this approach is hacky because it means the class won't work with general 
DataWriterFactories and consequently general DataWriters.

Original comment by lee.becker on 5 Feb 2011 at 7:53

GoogleCodeExporter commented 9 years ago
I thought about this a bit more, and I'm a little less worried about it not 
working with general DataWriterFactories because the moment you refer to 
PARAM_OUTPUT_DIRECTORY (which you *have* to do for your use case), you're 
already assuming a DirectoryDataWriterFactory. So I think casting to one is 
fine. Just document somewhere that this code assumes a 
DirectoryDataWriterFactory or subclass.

Original comment by steven.b...@gmail.com on 6 Feb 2011 at 9:37

GoogleCodeExporter commented 9 years ago
I do not see any reason that CleartkMultiAnnotator has to know any thing about 
the data writer factory except that it returns a data writer for a given 
type/label/name for the map of data writers.  The data writer factory for 
CleartkMultiAnnotator is going to be a new interface with one method:

public DataWriter<OUTCOME_TYPE> createDataWriter(String name)

The implementation should be able to delegate to existing data writer factories 
by updating the output directory.  That is, go ahead and add setOutputDirectory 
to the DirectoryDataWriter - but don't have CleartkMultiAnnotator call this 
method - have the MultiAnnotatorDataWriterFactory (or whatever it is called) do 
this.  This way MultiCleartkAnnotator is not coupled to DirectoryDataWriter.  

Original comment by phi...@ogren.info on 7 Feb 2011 at 5:07

GoogleCodeExporter commented 9 years ago
I'm not sure writing a MultiDataWriterFactory is necessarily the right 
approach.  Currently the dataWriterFactory is passed in as a configuration 
parameter to CleartkMultiAnnotator.  If I was to create a separate 
MultDataWriterFactory, I would have to create a Multi version for every 
DataWriterFactory, thus to use the DefaultBinaryMalletDataWriterFactory, I 
would have to create a DefaultBinaryMalletMultiDataWriterFactory that inherits 
from MultiDataWriterFactory.

Original comment by lee.becker on 7 Feb 2011 at 5:23

GoogleCodeExporter commented 9 years ago
Hmm.... it just occurred to me that I could create a MultiDataWriterFactory
that took as a regular DataWriterFactory class as a configuration parameter.
Does that make sense?  Is it possible to chain factories like that?

Original comment by lee.becker on 7 Feb 2011 at 5:23

GoogleCodeExporter commented 9 years ago
"The implementation should be able to delegate to existing data writer 
factories by updating the output directory" - use of PARAM_OUTPUT_DIRECTORY 
means that the MultiAnnotatorDataWriterFactory implementation will have to 
assume that the existing data writer factories have a PARAM_OUTPUT_DIRECTORY 
(i.e. that they're a subclass of DirectoryDataWriter). So you've just moved the 
assumption from MultiCleartkAnnotator to MultiAnnotatorDataWriterFactory. 
That's okay with me - but we should clearly document this assumption somewhere, 
regardless of whether it ends up in MultiCleartkAnnotator or 
MultiAnnotatorDataWriterFactory.

Original comment by steven.b...@gmail.com on 7 Feb 2011 at 5:30

GoogleCodeExporter commented 9 years ago
re 12 - yes, that makes sense - exactly what I was thinking.  Although I don't 
think of it as "chaining" - I think of it as delegation.  

I am fine with MultiAnnotatorDataWriterFactory being coupled with 
DirectoryDataWriter - much preferred over having MultiCleartkAnnotator.  

This really highlights the fact that what we should be careful naming these 
classes.  There's a few different names used above.  Here is my suggestion - 
name the interface MultiDataWriterFactory and name its one implementation 
MultiDirectoryDataWriterFactory.  

Original comment by phi...@ogren.info on 7 Feb 2011 at 5:57

GoogleCodeExporter commented 9 years ago
The code is checked into revision 2430.  Thanks for all the help and feedback.

Original comment by lee.becker on 7 Feb 2011 at 11:15

GoogleCodeExporter commented 9 years ago
Lee,

This is really great.  I have immediate use for this as I port my coordination 
code to cleartk-syntax-coordination.  

I have reopened this issue to address a few things:

1) You have touched on some of the real ugliness of Java generics in this (do 
you feel diseased!?)  I get this stuff messed up all the time.  I don't think 
you actually made a mistake, per se - but I would suggest making the member 
variables multiDataWriterFactory and multiClassifierFactory typed with 
OUTCOME_TYPE instead of '?'.  See the attached.  This causes one test to fail - 
but that's because the test is wrong with this version.  

2) We should revisit what exception getClassifier and getDataWriter should 
throw.  ResourceInitializationException seems conceptually correct.  However, 
it will always be called inside a subclass preprocess method and so it will 
have to get caught and an AnalysisEngineProcessException thrown.  So, I would 
vote that these methods throw AnalysisEngineProcessException.

3) What happens if the data writer factory is instantiated but getClassifier() 
is called?  I suppose it blows up because a NullPointerException is thrown.  I 
guess that is fine by me.  Still, we might consider throwing an exception 
instead to ease debugging.  

That's all I noticed.  I only looked at CleartkMultiAnnotator - so I have no 
critique of the other classes yet.  

Original comment by phi...@ogren.info on 8 Feb 2011 at 12:14

Attachments:

GoogleCodeExporter commented 9 years ago
I added license info to two files in r2431.

Original comment by phi...@ogren.info on 8 Feb 2011 at 12:17

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 12 Feb 2012 at 5:05

GoogleCodeExporter commented 9 years ago
I think we may need to address comment #16 before closing this issue.  I'm 
reopening so that it won't get lost....

Original comment by phi...@ogren.info on 12 Feb 2012 at 10:01

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 5:58

GoogleCodeExporter commented 9 years ago
We should decide whether to keep this. If it's not useful, we may want to 
deprecate it.

Original comment by steven.b...@gmail.com on 31 Jul 2012 at 10:07

GoogleCodeExporter commented 9 years ago
Fixed in r3968.

Original comment by lee.becker on 6 Aug 2012 at 7:38