tokenizer rewrite - Githubissues

GoogleCodeExporter commented 9 years ago

Our current default tokenizer 
(org.cleartk.token.tokenizer.PennTreebankTokenizer) is likely quite slow 
because of the way it uses a bunch of regular expressions to do it's work.  I 
have a partially implemented idea for a performant tokenizer that implements 
the same tokenization rules encoded in the regular expressions.  The basic idea 
is this:

1) Tokenize using the icu4j break iterator.  
2) Split any tokens that need splitting with a token splitter annotator
3) Merge any tokens that need merging with a token merger annotator

The first two of these I have already implemented in the biomedicus project and 
are detailed here:

http://code.google.com/p/biomedicus/wiki/Tokenization

So, the task for here would be to copy the existing annotator from there to 
here (or add a dependency) (observing the copyright/license), add the token 
merger annotator, and provide a default configuration that works for PTB-style 
tokenization.

Original issue reported on code.google.com by phi...@ogren.info on 29 Jan 2012 at 9:40

GoogleCodeExporter commented 9 years ago

... and test that the new tokenization approach is much faster.

Original comment by phi...@ogren.info on 29 Jan 2012 at 9:40

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 5:45

Added labels: Component-token, Milestone-1.3
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 4 Aug 2012 at 6:08

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by lee.becker on 17 Feb 2013 at 5:16

Added labels: Milestone-1.4
Removed labels: Milestone-1.3

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 3 May 2013 at 8:44

Added labels: Milestone-2.1
Removed labels: Milestone-1.4

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 15 Mar 2014 at 5:41

Added labels: Milestone-2.2
Removed labels: Milestone-2.1

fangfangli / cleartk

tokenizer rewrite #275