fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

tokenizer rewrite #275

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Our current default tokenizer 
(org.cleartk.token.tokenizer.PennTreebankTokenizer) is likely quite slow 
because of the way it uses a bunch of regular expressions to do it's work.  I 
have a partially implemented idea for a performant tokenizer that implements 
the same tokenization rules encoded in the regular expressions.  The basic idea 
is this:

1) Tokenize using the icu4j break iterator.  
2) Split any tokens that need splitting with a token splitter annotator
3) Merge any tokens that need merging with a token merger annotator

The first two of these I have already implemented in the biomedicus project and 
are detailed here:

http://code.google.com/p/biomedicus/wiki/Tokenization

So, the task for here would be to copy the existing annotator from there to 
here (or add a dependency) (observing the copyright/license), add the token 
merger annotator, and provide a default configuration that works for PTB-style 
tokenization.  

Original issue reported on code.google.com by phi...@ogren.info on 29 Jan 2012 at 9:40

GoogleCodeExporter commented 9 years ago
... and test that the new tokenization approach is much faster.

Original comment by phi...@ogren.info on 29 Jan 2012 at 9:40

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 5:45

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 4 Aug 2012 at 6:08

GoogleCodeExporter commented 9 years ago

Original comment by lee.becker on 17 Feb 2013 at 5:16

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 3 May 2013 at 8:44

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 15 Mar 2014 at 5:41