UCREL / pymusas

Python Multilingual Ucrel Semantic Analysis System
https://ucrel.github.io/pymusas/
Apache License 2.0
30 stars 13 forks source link

Wildcard single word lexicon rule matches (Auto Tag) #28

Open apmoore1 opened 2 years ago

apmoore1 commented 2 years ago

To support wildcard (*) syntax for single word lexicon files. This would also be useful for rules like all punctuation tokens, which should be labelled as the semantic category PUNCT, for punctuation.

The wildcard symbol in this syntax would mean that zero or more characters may appear after the word token and/or Part Of Speech (POS) tag. This syntax will therefore hold the same meaning between single word and Multi Word Expression files.

Example

Assuming the single word lexicon file:

lemma    pos    semantic_tags
*kg   num     N3.5
*    punc    PUNCT

In the first case it would allow tagging anything that ended with kg, e.g. 15kg to be tagged as a measurement, the N3.5 semantic tag. In the second case it would label all punctuation with the punctuation semantic tag, PUNCT.