fmarten / JoSimText

A system for word sense induction and disambiguation based on JoBimText approach
0 stars 0 forks source link

From trigrams to n-grams #5

Open alexanderpanchenko opened 7 years ago

alexanderpanchenko commented 7 years ago

Motivation

This is a part of a small series of improvements aiming to enrich the number of feature extractors natively in Spark. Currently, most of such extractors are part of the lefex or the "classical" jobimtext project and are thus only available in hadoop.

Currently, only trigrams can be computed. However, it would make sense to allow users to specify n in the n-gram model (the size of the context window).

Implementation

Specify a command line argument of the https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/Text2TrigramTermContext.scala which will specify the size of the left/right context. The default value of n=1 corresponds to the trigrams and used now. For n=2 we obtain 5-grams, for n=3 we obtain 7-grams, etc.