fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

chunker is WAY too complicated #302

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The current approach to Chunker is way too complicated:

(1) It forces you to learn an entirely new paradigm instead of leveraging 
existing understanding of the process(JCas) method like CleartkAnnotator and 
CleartkSequenceAnnotator do.

(2) It forces you to specify a ton of complicated parameters like 
PARAM_SEQUENCE_CLASS_NAME, PARAM_CHUNKER_FEATURE_EXTRACTOR_CLASS_NAME, etc.

(3) It forces you to use a CleartkSequenceAnnotator even if all you need is a 
CleartkAnnotator (though you could use Viterbi to get around this, but then you 
have to learn all the Viterbi parameters).

I think we want a much simpler API that lets anyone throw chunk-labeling on top 
of whatever CleartkAnnotator or CleartkSequenceAnnotator they already have. 
Something like:

    Chunking chunking = new BIOChunking(Token.class, NamedEntityMention.class);
    ...
    // during training
    List<String> outcomes = chunking.toOutcomes(jCas, tokens, namedEntityMentions);
    ...
    // during prediction
    List<NamedEntityMention> namedEntityMentions = chunking.toChunks(jCas, tokens, outcomes);

This way, you'd write a classifier annotator just like normal, but use the 
Chunking object to help you convert to and from outcomes. If you wanted to 
switch to IO chunking, you'd just create an instance of IOChunking instead of 
BIOChunking.

Original issue reported on code.google.com by steven.b...@gmail.com on 19 Apr 2012 at 3:26

GoogleCodeExporter commented 9 years ago
I've committed this API under org.cleartk.classifier.chunking. I'd like to 
deprecate the old chunker. What do you guys think?

For comparison, look at the changes in TimeAnnotator:

http://code.google.com/p/cleartk/source/browse/trunk/cleartk-timeml/src/main/jav
a/org/cleartk/timeml/time/TimeAnnotator.java?r=3887

http://code.google.com/p/cleartk/source/browse/trunk/cleartk-timeml/src/main/jav
a/org/cleartk/timeml/time/TimeAnnotator.java?r=3888

I think the biggest improvement is in just being able to write code like you 
would for any other CleartkSequenceAnnotator (that and getting rid of a crazy 
number of hard-to-understand UIMA parameters).

Original comment by steven.b...@gmail.com on 21 Apr 2012 at 12:35

GoogleCodeExporter commented 9 years ago
Steve,

This looks really great.  I was actually looking at the chunker recently and 
got discouraged because it was so complicated - and I wrote it!  This looks way 
easier and I am happy for you to deprecate the old approach.  

This might be a good time to rip out the chunk tokenizer which I doubt anyone 
is using.  For something like this it might suffice to send out an email to the 
user's list to see if anyone cares about it and if no one responds, then we 
simply remove it.  

Original comment by phi...@ogren.info on 21 Apr 2012 at 6:01

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r3889.

Original comment by steven.b...@gmail.com on 21 Apr 2012 at 6:53

GoogleCodeExporter commented 9 years ago
I deprecated the old chunker classes, as well as the chunk tokenizer. I've also 
opened Issue 303 to make sure that we eventually delete the chunk tokenizer. 
It's probably good practice to leave it in, deprecated, for one release before 
we rip it out.

Original comment by steven.b...@gmail.com on 21 Apr 2012 at 6:57

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 5 Aug 2012 at 8:50