MojoJolo / textteaser

TextTeaser is an automatic summarization algorithm.
MIT License
1.97k stars 251 forks source link

tutorial on adding additional language support #9

Closed jasonswearingen closed 10 years ago

jasonswearingen commented 10 years ago

i see there's a couple binary corpus files, but i don't see any info on how these are generated and/or how to add additional language support.

MojoJolo commented 10 years ago

The corpus files are used by OpenNLP to split sentences. You can see corpus for other languages here: http://opennlp.sourceforge.net/models-1.5/

And if you want to create a model for yourself, here's the instructions from OpenNLP: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training

jasonswearingen commented 10 years ago

thanks for the info, i guess due to no NLP experience I won't be able to contribute though :(

MojoJolo commented 10 years ago

@jasons-novaleaf NLP experience is not required. :) You can contribute by gathering articles of the language you choose. Split those articles into sentences via new line. It can then be used as a corpus. The instruction into building a corpus is easy, just follow the link I posted above.