fmarten / JoSimText

A system for word sense induction and disambiguation based on JoBimText approach
0 stars 0 forks source link

Support of multiword expressions for the trigrams #8

Open alexanderpanchenko opened 7 years ago

alexanderpanchenko commented 7 years ago

Motivation

n-gram based features extraction should also support multiword expressions. otherwise, only single terms can be represented and important terms, such as "ice cream" or "new york times" in principle cannot end up in a dt.

Implementation

  1. Add to the https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/Text2TrigramTermContext.scala an optional parameter that takes a vocabulary file as an input in the same format as this file https://github.com/uhh-lt/josimtext/blob/master/src/test/resources/voc-tiny.csv

  2. Generate features for all single words exactly in the same way as you do now, but in addition to this, generate also features for all multiword expressions found in the input list. Example:

Input text:

This ice cream is sweet.

Input MWE vocabulary:

ice cream

Features generated:

this _@_ice
ice this_@_cream
cream ice_@_is
is cream_@_sweet
ice cream this_@_is

Multiword expressions loaded from the input file can be lowercased and a "lower-cased match" in text should be considered to be sufficient: when you check a match it is sufficient to check if a text sequence in the lower cased form is in the dictionary of loaded MWEs.