UCREL / pymusas

Python Multilingual Ucrel Semantic Analysis System
https://ucrel.github.io/pymusas/
Apache License 2.0
30 stars 13 forks source link

Rule Based Tagger, how to assign semantic tags when no POS information and two entries exist for the same word #9

Open apmoore1 opened 3 years ago

apmoore1 commented 3 years ago

Problem

In the rule based tagger we have the rules as defined in the USASRuleBasedTagger class. When we have a lexicon lookup for a word/lemma/token, whereby either the word's POS tag does not exist in the lookup or no POS tag information is given, and the word appears twice in the lexicon lookup with different POS tags and different semantic tags what should the tagger do?

At the moment the tagger assign the semantic tags from the word that appears last in the lexicon lookup / lexicon TSV file. This only happens at the moment due to the way that Python creates a dictionary / hash map.

Below is an example of the problem:

Given the word sauf in French, within the USAS french semantic lexicon there are two entries for this word as it has two different possible POS types. At the moment if the Rule Based Tagger was given this word to tag without any POS information or a POS tag that is not in the lexicon then it would assign the tags [A1.8-, Z5] as those are the tags for the last sauf entry in the lexicon.

Solutions

  1. Keep it as it is.
  2. We assign all of the semantic tags for all entries for that word. If we use this solution then we need to think about the order of the tags e.g. which semantic tag should be the first in the list and therefore the most likely semantic tag? Further, we need to ensure we do not duplicate semantic tags.