UCREL / Multilingual-USAS

Lexicons for the Multilingual UCREL Semantic Analysis System
Other
38 stars 13 forks source link

POS tagsets used in the lexicon files #4

Open apmoore1 opened 2 years ago

apmoore1 commented 2 years ago

Some of the lexicons have POS tags, of which we want to see if they all share the same POS tagset or different. When this is established we shall document the POS tagset that each lexicon uses. This will also allow us to create a POS tag checker to ensure that the lexicon files only contain valid POS tags given the tagset that is associated with that lexicon.

apmoore1 commented 2 years ago

In the repository, there are 4 POS tagset files:

  1. Chinese - Subset of the USAS core tagset, apart from four POS tags:
    1. loc which represents localiser
    2. msr which represents measure word (this might map to numerals, but this POS tagset does already have a POS tag for numerals)
    3. ono which represents onomatopoeia - only one lexicon entry is marked with this in the single semantic lexicon: 砰 ono W4 X3.2+ Q2.2 E3-
    4. mark which represents marker
  2. Spanish - Subset of the USAS core tagset, apart from two POS tags:
    1. port which represents Portmanteau word. This POS tag only occurs in the MWE semantic lexicon, e.g. acercarse_fw al_port sol_noun que_pron más_adv calienta_verb S7 X5.2
    2. sys which represents symbols. However no entry in either the single or MWE lexicon uses this POS tag.
  3. Portuguese - Subset of the USAS core tagset
  4. Dutch - There is only 1 POS tag that I think should be removed, it also does not feature in the Dutch semantic lexicon, sent, which represents sentence marker. The rest of the tagset is a subset of the USAS core tagset.
apmoore1 commented 2 years ago

USAS Core POS Tagset

This tagset has come from table 5 of the Towards A Welsh Semantic Annotation System paper

  1. noun
  2. verb
  3. Adj – Adjective
  4. Adv – Adverb
  5. Num – Numerals
  6. Pnoun – Proper Noun
  7. Intj – Interjection
  8. Art – Article
  9. Part – Particle
  10. Prep – Preposition
  11. Conj – Conjunction
  12. Pron – Pronoun
  13. Code – Special code, e.g. Math symbols
  14. Punc – Punctuation
  15. Fw – Foreign Word
  16. Abbrev – Abbreviation
  17. Lett – Letter
  18. Xx – Unrecognized token
  19. DET
perayson commented 2 years ago

For Chinese, I will need to discuss this with Scott, and for Spanish with Antonio since it relates to his Grampal POS tagger I expect. Portuguese is fine if it is a subset as this is mappable. For Dutch, sent can be removed or replaced with the punctuation tag.

apmoore1 commented 2 years ago

For Chinese, I will need to discuss this with Scott, and for Spanish with Antonio since it relates to his Grampal POS tagger I expect. Portuguese is fine if it is a subset as this is mappable. For Dutch, sent can be removed or replaced with the punctuation tag.

I think we can safely remove sent for Dutch as it is not in any of the semantic lexicons.

apmoore1 commented 2 years ago

USAS Core Tagset

This tagset has come from table 5 of the Towards A Welsh Semantic Annotation System paper

  1. noun
  2. verb
  3. Adj – Adjective
  4. Adv – Adverb
  5. Num – Numerals
  6. Pnoun – Proper Noun
  7. Intj – Interjection
  8. Art – Article
  9. Part – Particle
  10. Prep – Preposition
  11. Conj – Conjunction
  12. Pron – Pronoun
  13. Code – Special code, e.g. Math symbols
  14. Punc – Punctuation
  15. Fw – Foreign Word
  16. Abbrev – Abbreviation
  17. Lett – Letter
  18. Xx – Unrecognized token
  19. DET

I think the Lett pos tag can be removed from this list as it does not occur in any of the semantic lexicons (single and Multi Word Expression (MWE))

apmoore1 commented 2 years ago

Other POS tags that might be of interest with regards to how often they occur:

Abbr occurs in:

  1. Spanish lexicon 2 times.
  2. Finish lexicon 409 times.

Art occurs in:

  1. Spanish 75 times
  2. Italian 659 times
  3. Dutch 2 times.

Det occurs in:

  1. Spanish 1 time.
  2. Portuguese 289 times.
  3. French 86 times.
  4. Chinese 411 times.
perayson commented 2 years ago

It may be in the English lexicon since ZZ1 occurs in the CLAWS C7 POS tagset. I wonder what UD POS recommends instead?

apmoore1 commented 2 years ago

It may be in the English lexicon since ZZ1 occurs in the CLAWS C7 POS tagset. I wonder what UD POS recommends instead?

UD POS does not have Lett nor Abbr/Abbrev, I think in UD it comes down to what the full form of the Lett or Abbr is e.g. does it represent a person or company therefore I assume it will then be assigned a noun, this is what I took away from reading the SYM POS tag notes from UD. Or perhaps they would be mapped to X? I am not an expert in POS tagging so I will leave it to your much better judgement @perayson

perayson commented 2 years ago

Ah, so CLAWS C5 tagset has ZZ0 for alphabetical symbol (http://ucrel.lancs.ac.uk/claws5tags.html) and Lou Burnard maps this to SYM (https://github.com/COST-ELTeC/Scripts/blob/master/posPipe/udpMap.py). I'm trying to think of counter examples, but I can't imagine separating SYM from Letter would help distinguish items semantically (which is the main point of the POS column for our purposes here). Probably need a longer look at this over all the languages at some point.

apmoore1 commented 2 years ago

The #12 Pull Request, contains the generated POS tagset per language, of which the format of these generated POS tagsets and where to find them is best explained in the Create Pos Tagsets section of PR's README

apmoore1 commented 2 years ago

Just a bit of a side note, in the PyMUSAS library I have changed the name of the tagset from UD to UPOS to reflect that the Part Of Speech tagset used in the Universal Dependencies Treebank is the Universal Part Of Speech (UPOS) tagset. This can be seen best in the pos mapping part of the PyMUSAS library: https://ucrel.github.io/pymusas/api/pos_mapper