Open bethard opened 9 years ago
Comment #1 originally posted by ClearTK on 2009-06-10T22:13:59.000Z:
<empty>
Comment #2 originally posted by ClearTK on 2010-12-01T18:18:34.000Z:
Folks, it seems that this really needs to be done for the cleartk project reorg underway so that a variety of example code, etc. that makes use of sentence segmentation does not have to depend on the cleartk-syntax-opennlp project.
I recently implemented a sentence boundary detector using ClearTK for another project. See [http://code.google.com/p/biomedicus/source/browse/#svn/trunk/Biomedicus/src/main/java/edu/umn/biomedicus/sentence here] for the code and am considering either copying it over or implementing something very similar. The way it works is that a pattern based "sentence boundary" annotator goes through and creates positive and negative sentence boundaries. Positive sentence boundaries will later be used to create sentence annotations. A classifier-based "sentence boundary" annotator is then used to classify periods. Negative sentence boundaries produced by the pattern-based approach are ignored by the classifier-based approach. Actually, it will ignore all sentence boundaries created by the patter-based approach. Finally, a sentence annotator creates "sentence" annotations from the "sentence boundary" annotations.
I'm not sure what I should train the sentence segmentor on. I think using penn treebank is silly for a variety of reasons - in particular because white space is not preserved. I am considering annotating sentences myself with data from project gutenberg or wikipedia - unless you have better suggestions.
Any suggestions about implementation or training data are welcome. Thanks!
Comment #3 originally posted by ClearTK on 2010-12-01T20:58:16.000Z:
Note that the paper you reference requires tokenization to happen before sentence segmentation, and says:
"First, proper tokenization is key. While there is not room to catalog our tokenizer rules, we note that both untokenized text and mismatched train-test tokenization can increase the error rate by a factor of 2."
The PennTreebankTokenizer assumes that it already has sentence boundaries. So if we actually implement the referenced paper, we'd need to change the tokenizer too.
As to corpora on which you could train, OpenANC has validated sentence boundaries in some files:
http://www.americannationalcorpus.org/OANC/index.html#annotations
Also, you could choose to go the Treebank route, but also include the Brown corpus part too so it's not so WSJ-ish:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42
Or you could use both. ;-)
Comment #4 originally posted by ClearTK on 2010-12-28T21:42:51.000Z:
I am going to punt on this issue for a bit - at least until after I've finished the project re-org. Instead I am going to provide a simple default sentence annotator based on the java.text.BreakIterator that we can use for test purposes.
Comment #5 originally posted by ClearTK on 2011-01-14T21:19:06.000Z:
<empty>
Comment #6 originally posted by ClearTK on 2012-07-24T20:18:26.000Z:
<empty>
Comment #7 originally posted by ClearTK on 2013-02-17T17:09:15.000Z:
<empty>
Original issue 99 created by ClearTK on 2009-06-08T14:00:05.000Z:
There was a nice paper at NAACL 2009 about sentence boundary detection that should be straightforward to implement for ClearTK. See:
http://www.icsi.berkeley.edu/pubs/speech/sbd_naacl_2009.pdf