ClearTK / cleartk

Machine learning components for Apache UIMA
http://cleartk.github.io/cleartk/
Other
130 stars 58 forks source link

sentence boundary detector #97

Open bethard opened 9 years ago

bethard commented 9 years ago

Original issue 99 created by ClearTK on 2009-06-08T14:00:05.000Z:

There was a nice paper at NAACL 2009 about sentence boundary detection that should be straightforward to implement for ClearTK. See:

http://www.icsi.berkeley.edu/pubs/speech/sbd_naacl_2009.pdf

bethard commented 9 years ago

Comment #1 originally posted by ClearTK on 2009-06-10T22:13:59.000Z:

<empty>

bethard commented 9 years ago

Comment #2 originally posted by ClearTK on 2010-12-01T18:18:34.000Z:

Folks, it seems that this really needs to be done for the cleartk project reorg underway so that a variety of example code, etc. that makes use of sentence segmentation does not have to depend on the cleartk-syntax-opennlp project.

I recently implemented a sentence boundary detector using ClearTK for another project. See [http://code.google.com/p/biomedicus/source/browse/#svn/trunk/Biomedicus/src/main/java/edu/umn/biomedicus/sentence here] for the code and am considering either copying it over or implementing something very similar. The way it works is that a pattern based "sentence boundary" annotator goes through and creates positive and negative sentence boundaries. Positive sentence boundaries will later be used to create sentence annotations. A classifier-based "sentence boundary" annotator is then used to classify periods. Negative sentence boundaries produced by the pattern-based approach are ignored by the classifier-based approach. Actually, it will ignore all sentence boundaries created by the patter-based approach. Finally, a sentence annotator creates "sentence" annotations from the "sentence boundary" annotations.

I'm not sure what I should train the sentence segmentor on. I think using penn treebank is silly for a variety of reasons - in particular because white space is not preserved. I am considering annotating sentences myself with data from project gutenberg or wikipedia - unless you have better suggestions.

Any suggestions about implementation or training data are welcome. Thanks!

bethard commented 9 years ago

Comment #3 originally posted by ClearTK on 2010-12-01T20:58:16.000Z:

Note that the paper you reference requires tokenization to happen before sentence segmentation, and says:

"First, proper tokenization is key. While there is not room to catalog our tokenizer rules, we note that both untokenized text and mismatched train-test tokenization can increase the error rate by a factor of 2."

The PennTreebankTokenizer assumes that it already has sentence boundaries. So if we actually implement the referenced paper, we'd need to change the tokenizer too.

As to corpora on which you could train, OpenANC has validated sentence boundaries in some files:

http://www.americannationalcorpus.org/OANC/index.html#annotations

Also, you could choose to go the Treebank route, but also include the Brown corpus part too so it's not so WSJ-ish:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42

Or you could use both. ;-)

bethard commented 9 years ago

Comment #4 originally posted by ClearTK on 2010-12-28T21:42:51.000Z:

I am going to punt on this issue for a bit - at least until after I've finished the project re-org. Instead I am going to provide a simple default sentence annotator based on the java.text.BreakIterator that we can use for test purposes.

bethard commented 9 years ago

Comment #5 originally posted by ClearTK on 2011-01-14T21:19:06.000Z:

<empty>

bethard commented 9 years ago

Comment #6 originally posted by ClearTK on 2012-07-24T20:18:26.000Z:

<empty>

bethard commented 9 years ago

Comment #7 originally posted by ClearTK on 2013-02-17T17:09:15.000Z:

<empty>

bethard commented 9 years ago

Comment #8 originally posted by ClearTK on 2013-05-03T08:44:33.000Z:

<empty>

bethard commented 9 years ago

Comment #9 originally posted by ClearTK on 2014-03-15T17:41:52.000Z:

<empty>