citiususc / Linguakit

Multilingual toolkit for NLP: dependency parser, PoS tagger, NERC, multiword extractor, sentiment analysis, etc.
GNU General Public License v3.0
64 stars 22 forks source link

Added support for locutions #16

Closed sdocio closed 4 years ago

sdocio commented 4 years ago

Tokens that are part of a locution are joined at the Splitter module. A resources file (locutions.txt) contains a list of non ambiguous locutions in the same format as the standard dictionary.

Some initial unit tests are added for the splitter.