MojoJolo / textteaser

TextTeaser is an automatic summarization algorithm.
MIT License
1.97k stars 251 forks source link

Build a corpus for other languages. #15

Closed MojoJolo closed 9 years ago

MojoJolo commented 10 years ago

TextTeaser uses OpenNLP to split sentences. And OpenNLP requires a corpus for it to split sentences properly.

Here's a list of languages with corpus from OpenNLP: http://opennlp.sourceforge.net/models-1.5/

Limited language are supported by the sentence detector of OpenNLP. It's good if we can have other language supported too. E.g. Russian, Chinese, Japanese.

Here's an instruction to create a sentence detector corpus: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training