This is a part of a small series of improvements aiming to enrich the number of feature extractors natively in Spark. Currently, most of such extractors are part of the lefex or the "classical" jobimtext project and are thus only available in hadoop.
Currently, only trigrams can be computed. However, it would make sense to allow users to specify n in the n-gram model (the size of the context window).
Motivation
This is a part of a small series of improvements aiming to enrich the number of feature extractors natively in Spark. Currently, most of such extractors are part of the lefex or the "classical" jobimtext project and are thus only available in hadoop.
Currently, only trigrams can be computed. However, it would make sense to allow users to specify n in the n-gram model (the size of the context window).
Implementation
Specify a command line argument of the https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/Text2TrigramTermContext.scala which will specify the size of the left/right context. The default value of n=1 corresponds to the trigrams and used now. For n=2 we obtain 5-grams, for n=3 we obtain 7-grams, etc.