Open GoogleCodeExporter opened 9 years ago
Original comment by pvogren@gmail.com
on 10 Jun 2009 at 10:13
Folks, it seems that this really needs to be done for the cleartk project reorg
underway so that a variety of example code, etc. that makes use of sentence
segmentation does not have to depend on the cleartk-syntax-opennlp project.
I recently implemented a sentence boundary detector using ClearTK for another
project. See
[http://code.google.com/p/biomedicus/source/browse/#svn/trunk/Biomedicus/src/mai
n/java/edu/umn/biomedicus/sentence here] for the code and am considering either
copying it over or implementing something very similar. The way it works is
that a pattern based "sentence boundary" annotator goes through and creates
positive and negative sentence boundaries. Positive sentence boundaries will
later be used to create sentence annotations. A classifier-based "sentence
boundary" annotator is then used to classify periods. Negative sentence
boundaries produced by the pattern-based approach are ignored by the
classifier-based approach. Actually, it will ignore all sentence boundaries
created by the patter-based approach. Finally, a sentence annotator creates
"sentence" annotations from the "sentence boundary" annotations.
I'm not sure what I should train the sentence segmentor on. I think using penn
treebank is silly for a variety of reasons - in particular because white space
is not preserved. I am considering annotating sentences myself with data from
project gutenberg or wikipedia - unless you have better suggestions.
Any suggestions about implementation or training data are welcome. Thanks!
Original comment by pvogren@gmail.com
on 1 Dec 2010 at 6:18
Note that the paper you reference requires tokenization to happen *before*
sentence segmentation, and says:
"First, proper tokenization is key. While there is not room to catalog our
tokenizer rules, we note that both untokenized text and mismatched train-test
tokenization can increase the error rate by a factor of 2."
The PennTreebankTokenizer assumes that it already has sentence boundaries. So
if we actually implement the referenced paper, we'd need to change the
tokenizer too.
As to corpora on which you could train, OpenANC has validated sentence
boundaries in some files:
http://www.americannationalcorpus.org/OANC/index.html#annotations
Also, you could choose to go the Treebank route, but also include the Brown
corpus part too so it's not so WSJ-ish:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42
Or you could use both. ;-)
Original comment by steven.b...@gmail.com
on 1 Dec 2010 at 8:58
I am going to punt on this issue for a bit - at least until after I've finished
the project re-org. Instead I am going to provide a simple default sentence
annotator based on the java.text.BreakIterator that we can use for test
purposes.
Original comment by pvogren@gmail.com
on 28 Dec 2010 at 9:42
Original comment by pvogren@gmail.com
on 14 Jan 2011 at 9:19
Original comment by steven.b...@gmail.com
on 24 Jul 2012 at 8:18
Original comment by lee.becker
on 17 Feb 2013 at 5:09
Original comment by steven.b...@gmail.com
on 3 May 2013 at 8:44
Original comment by phi...@ogren.info
on 15 Mar 2014 at 5:41
Original issue reported on code.google.com by
pvogren@gmail.com
on 8 Jun 2009 at 2:00