Open apmoore1 opened 2 years ago
In the repository, there are 4 POS tagset files:
loc
which represents localiser
msr
which represents measure word
(this might map to numerals, but this POS tagset does already have a POS tag for numerals)ono
which represents onomatopoeia
- only one lexicon entry is marked with this in the single semantic lexicon: 砰 ono W4 X3.2+ Q2.2 E3-
mark
which represents marker
port
which represents Portmanteau word
. This POS tag only occurs in the MWE semantic lexicon, e.g. acercarse_fw al_port sol_noun que_pron más_adv calienta_verb S7 X5.2
sys
which represents symbols
. However no entry in either the single or MWE lexicon uses this POS tag.sent
, which represents sentence marker
. The rest of the tagset is a subset of the USAS core tagset.This tagset has come from table 5 of the Towards A Welsh Semantic Annotation System paper
For Chinese, I will need to discuss this with Scott, and for Spanish with Antonio since it relates to his Grampal POS tagger I expect. Portuguese is fine if it is a subset as this is mappable. For Dutch, sent can be removed or replaced with the punctuation tag.
For Chinese, I will need to discuss this with Scott, and for Spanish with Antonio since it relates to his Grampal POS tagger I expect. Portuguese is fine if it is a subset as this is mappable. For Dutch, sent can be removed or replaced with the punctuation tag.
I think we can safely remove sent
for Dutch as it is not in any of the semantic lexicons.
USAS Core Tagset
This tagset has come from table 5 of the Towards A Welsh Semantic Annotation System paper
- noun
- verb
- Adj – Adjective
- Adv – Adverb
- Num – Numerals
- Pnoun – Proper Noun
- Intj – Interjection
- Art – Article
- Part – Particle
- Prep – Preposition
- Conj – Conjunction
- Pron – Pronoun
- Code – Special code, e.g. Math symbols
- Punc – Punctuation
- Fw – Foreign Word
- Abbrev – Abbreviation
- Lett – Letter
- Xx – Unrecognized token
- DET
I think the Lett
pos tag can be removed from this list as it does not occur in any of the semantic lexicons (single and Multi Word Expression (MWE))
Other POS tags that might be of interest with regards to how often they occur:
Abbr occurs in:
Art occurs in:
Det occurs in:
It may be in the English lexicon since ZZ1 occurs in the CLAWS C7 POS tagset. I wonder what UD POS recommends instead?
It may be in the English lexicon since ZZ1 occurs in the CLAWS C7 POS tagset. I wonder what UD POS recommends instead?
UD POS does not have Lett nor Abbr/Abbrev, I think in UD it comes down to what the full form of the Lett or Abbr is e.g. does it represent a person or company therefore I assume it will then be assigned a noun, this is what I took away from reading the SYM POS tag notes from UD. Or perhaps they would be mapped to X? I am not an expert in POS tagging so I will leave it to your much better judgement @perayson
Ah, so CLAWS C5 tagset has ZZ0 for alphabetical symbol (http://ucrel.lancs.ac.uk/claws5tags.html) and Lou Burnard maps this to SYM (https://github.com/COST-ELTeC/Scripts/blob/master/posPipe/udpMap.py). I'm trying to think of counter examples, but I can't imagine separating SYM from Letter would help distinguish items semantically (which is the main point of the POS column for our purposes here). Probably need a longer look at this over all the languages at some point.
The #12 Pull Request, contains the generated POS tagset per language, of which the format of these generated POS tagsets and where to find them is best explained in the Create Pos Tagsets section of PR's README
Just a bit of a side note, in the PyMUSAS library I have changed the name of the tagset from UD to UPOS to reflect that the Part Of Speech tagset used in the Universal Dependencies Treebank is the Universal Part Of Speech (UPOS) tagset. This can be seen best in the pos mapping part of the PyMUSAS library: https://ucrel.github.io/pymusas/api/pos_mapper
Some of the lexicons have POS tags, of which we want to see if they all share the same POS tagset or different. When this is established we shall document the POS tagset that each lexicon uses. This will also allow us to create a POS tag checker to ensure that the lexicon files only contain valid POS tags given the tagset that is associated with that lexicon.