Closed apmoore1 closed 2 years ago
It turns out that spaCy does have a mapping between Chinese Penn Treebank POS tagset and UPOS, I did not find this as I was inspecting the nlp.analyze_pipes
function output, whereas the mapping is within their AttributeRuler patterns. This means that we do not need a mapping of our own anymore, however adding spaCy's mapping to our code base I think would be a good idea, what do you think @perayson? . The spaCy mapping can be found like so:
# First you will need to download the Chinese spaCy model like so:
# python -m spacy download zh_core_web_sm
import spacy
nlp = spacy.load('zh_core_web_sm')
attribute_ruler = nlp.get_pipe('attribute_ruler')
for pattern in attribute_ruler.patterns:
print(pattern)
This will output the following:
{'patterns': [[{'TAG': 'AS'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEG'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DER'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DEV'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'ETC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'LC'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'MSP'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'SP'}]], 'attrs': {'POS': 'PART', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'BA'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'FW'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'IJ'}]], 'attrs': {'POS': 'INTJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'LB'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'ON'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'SB'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'X'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'URL'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'INF'}]], 'attrs': {'POS': 'X', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NN'}]], 'attrs': {'POS': 'NOUN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NR'}]], 'attrs': {'POS': 'PROPN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'NT'}]], 'attrs': {'POS': 'NOUN', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VA'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VC'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VE'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'VV'}]], 'attrs': {'POS': 'VERB', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CD'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'M'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'OD'}]], 'attrs': {'POS': 'NUM', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'DT'}]], 'attrs': {'POS': 'DET', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CC'}]], 'attrs': {'POS': 'CCONJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'CS'}]], 'attrs': {'POS': 'SCONJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'AD'}]], 'attrs': {'POS': 'ADV', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'JJ'}]], 'attrs': {'POS': 'ADJ', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'P'}]], 'attrs': {'POS': 'ADP', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'PN'}]], 'attrs': {'POS': 'PRON', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': 'PU'}]], 'attrs': {'POS': 'PUNCT', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'TAG': '_SP'}]], 'attrs': {'POS': 'SPACE', 'MORPH': '_'}, 'index': 0}
{'patterns': [[{'IS_SPACE': True}]], 'attrs': {'TAG': '_SP', 'POS': 'SPACE'}, 'index': 0}
From this we can see the mapping between Chinese Penn Treebank and UPOS, e.g. AS
is equivalent to PART
. This mapping is slightly different to the original UPOS mapping, which can be found here, as the original UPOS tagset has been expanded through the Universal Dependencies project, the expanded UPOS tagset can be found here. However the mappings are similar with the following differences:
Chinese Penn Treebank, spaCy Mapping, Original UPOS
IJ, INTJ, X
CS, SCONJ, CONJ
NR, PROPN, NOUN
The following tags are not in the original Chinese Penn Treebank POS tagset but are in the spaCy model and having the following mappings to UPOS
spaCy Chinese Penn Treebank, Mapping
INF, X
URL, X
X, X
I think if we do add the Chinese Penn Treebank mappings to PyMUSAS so that we have a map from Chinese Penn Treebank to USAS core POS tagset, we do it through the spaCy mapping, e.g. map from:
Chinese Penn Treebank
-> spaCy UPOS mapping
-> USAS core
Great, this sounds good, please go ahead, and then I can ask Scott or others to sanity check the output.
The Chinese spaCy model outputs POS tags that come from the Chinese treebank tagset rather than the Universal POS tagset. This therefore requires a mapping from the Chinese treebank tagset to the USAS core tagset to be able to use the POS tagger within the Chinese spaCy model for the USASRuleBasedTagger if we would like to make the most of the POS information within the Chinese USAS lexicon.
A solution to this is to take the mapping from the Universal POS (UPOS) tagset for mapping between the Chinese treebank tagset to the UPOS tagset, of which the mapping can be found here and swap the UPOS tags in that mapping to USAS core tagsets using the mapping we have current for UPOS to USAS core.