fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

Provide DefaultDataWriterFactory #281

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Richard asked about why we always need a DefaultXXXDataWriterFactory for each 
machine learning model [1]. Lee and I talked this over, and I think we can 
simplify this by:

* Moving initialization of the default feature encoders into the XXXDataWriter 
classes.

* Creating a new DefaultDataWriterFactory that takes 
PARAM_DATA_WRITER_CLASS_NAME and PARAM_OUTPUT_DIRECTORY parameters, and creates 
the DataWriter directly.

* Making CleartkAnnotator and CleartkSequenceAnnotator use the new 
DefaultDataWriterFactory as the default value of 
PARAM_DATA_WRITER_FACTORY_CLASS_NAME

Some benefits of this approach:

* I should be 100% backwards compatible

* Feature encoders will be set in the same class where they're actually used

* Implementors of new classifiers have to implement 1 fewer classes

* We could easily support DataWriters that were Initializable

[1] https://groups.google.com/d/topic/cleartk-developers/n4P2hISQXdo/discussion

Original issue reported on code.google.com by steven.b...@gmail.com on 7 Feb 2012 at 7:19

GoogleCodeExporter commented 9 years ago
We agreed to go with this approach at the ClearTK day today.

Original comment by steven.b...@gmail.com on 12 Feb 2012 at 8:20

GoogleCodeExporter commented 9 years ago
Ok, one issue I'm running into with this is that it's no longer so easy for 
CleartkAnnotator to decide if it's training or predicting.

In the past, there was always a default classifier factory 
(JarClassifierFactory), but there was no default data writer factory. So we 
could tell if we were training by looking to see if a data writer factory had 
been specified.

With the approach in this issue, there will now always be both a default 
classifier factory (still JarClassifierFactory) and a default data writer 
factory (DefaultDataWriterFactory, or whatever we call it). So the old 
heuristic for guessing whether we were training or not will now fail.

I see a few solutions:

(1) Always force people to specify PARAM_IS_TRAINING. This would mean every 
creation of a CleartkAnnotator would require specifying an additional 
configuration parameter. This parameter would be conceptually redundant with 
the fact that you're specifying either 
JarClassifierFactory.PARAM_CLASSIFIER_JAR_PATH or 
DefaultDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME. But it wouldn't be 
technically redundant, because these are "implementation details" of 
JarClassifierFactory and DefaultDataWriterFactory that CleartkAnnotator doesn't 
necessarily know about.

(2) Have CleartkAnnotator check for the presence of 
JarClassifierFactory.PARAM_CLASSIFIER_JAR_PATH or 
DefaultDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME. This would keep user 
code simple, but would add special cases to CleartkAnnotator specifically for 
JarClassifierFactory and DefaultDataWriterFactory. (Of course, these two are 
what 99.9% of people are going to be using, so maybe it makes sense to special 
case them.)

(3) Don't create a DefaultDataWriterFactory, and instead have CleartkAnnotator 
itself take the PARAM_DATA_WRITER_CLASS_NAME. Then if either a 
DataWriterFactory or a DataWriter was specified, we'd know that we were 
training. But if we merge the DefaultDataWriterFactory functionality into 
CleartkAnnotator, for symmetry, it seems like we'd also want to merge the 
JarClassifierFactory functionality in there too.

Right now, I'm leaning towards (2) because, though (1) is probably the purest 
approach, (2) seems to be much more practical, and doesn't couple the factories 
with CleartkAnnotator like (3) would.

Original comment by steven.b...@gmail.com on 24 Apr 2012 at 3:29

GoogleCodeExporter commented 9 years ago
My first reaction is to recommend (1).  While your argument for (2) is true now 
- it does not seem unlikely that it will not be true in the future.  I can 
imagine classifiers and data writers implemented in completely different ways 
outside of our current data-writer-to-file / classifier-from-jar paradigm.  For 
example, a data writer might be implemented as a client that sends messages 
(i.e. instances) to a server that is continuously training a model.  Something 
like that would probably not be handled by these params.  Also, is it really 
that onerous to set a single boolean parameter?  It might make the code 
clearer....

That said, as things are now - (2) probably makes the most sense.  We could 
circle back to this issue when we need to and make a change then as necessary.  
That's generally been our approach in the past.  

Original comment by phi...@ogren.info on 25 Apr 2012 at 3:41

GoogleCodeExporter commented 9 years ago
Note that (2) doesn't prevent you from specifying PARAM_IS_TRAINING as in (1). 
So if you want to be explicit, you can already do so, and you can do so with 
(2).

> We could circle back to this issue when we need to and make a change then as 
necessary.

Yep. If we really feel like everyone should be specifying PARAM_IS_TRAINING 
(and we want to stop inferring it automatically), we can issue deprecation 
warnings for any path but the explicit PARAM_IS_TRAINING path.

Ok, I'll go ahead with (2).

Original comment by steven.b...@gmail.com on 25 Apr 2012 at 8:51

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r3895.

Original comment by steven.b...@gmail.com on 25 Apr 2012 at 2:05

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 5 Aug 2012 at 8:50