Closed honnibal closed 8 years ago
Hi Matthew,
Can't speak for other languages and your question might be specifically targeting English, but FWIW this is what holds for Finnish:
LEMMA+XPOS+XFEAT
-> POS+FEAT
and in two cases even LEMMA+XPOS+XFEAT+DEPREL
-> POS+FEAT
(where X
marks original data in the original treebank). Note that XFEAT
is not present in the final UD file.dev-ud
branch of the Finnish-dep-parser GitHub project here: https://github.com/TurkuNLP/Finnish-dep-parser/tree/dev-ud/morpho-sd2ud It's not a simple lookup table, but a set of scripts which transform the output of the Finnish morphological analyzer (XPOS+XFEAT) into the UD versions POS+FEAT.Best,
Filip
Thanks. I'm targeting multi-lingual. I'm hoping the simpler mapping will work for most languages, although I know some languages might need to work differently.
How does your lemmatization work? Do you go ORTH+XPOS
-> LEMMA+XPOS+XFEAT
?
Finnish lemmatization uses the two-level morphological analyzer OMorFi
. That is distributed with the parser. So we go ORTH
-> [morpho analyzer] -> several LEMMA+XPOS+XFEAT
alternatives -> [bunch of scripts] -> several LEMMA+POS+FEAT
alternatives. And then a CRF (Marmot) to disambiguate the competing readings.
We were working on this for Kazakh at the TurkLang conference last week. Here is the spreadsheet we've been writing to be able to have a consistent mapping between our two standards (KNC, Apertium) and UD. Note that we conceive the mapping as unidirectional.
As Filip mentioned in the worst case you do need (LEMMA)+XPOS+XFEAT+DEPREL
(for example to make a pronoun/determiner distinction or noun/adjective).
@honnibal : These tables of yours could be quite useful and they can obviously improve over Interset. But as others mentioned, even with lemma it cannot be perfect. Beyond that, it is probably easier to write scripts than tables, and to consider the tree structure (that's what I do).
As for your specific questions, I would say that myself is Case=Acc
. I would not mark case for mine because it can be used both as subject and object, without changing form. I wouldn't mark gender, number and case for you (the rule of thumb is: if your list of values contains all values available for the language (such as Number=Plur,Sing
for English), then drop the feature entirely).
On the other hand, we have not settled on the scale between form and function as criteria for morphological (and POS) distinctions. For instance the German STTS tagset is very context-sensitive: the nouns inflect for case quite rarely (and usually the case is determined by the article), yet every noun has one of the four case values assigned. So if you are willing and able to disambiguate, you could distinguish between you that is Case=Nom
and you that is Case=Acc
. (In pure theory, the same could be done with Number
, but I'm afraid that it would be often undecidable even for human annotators.)
Thanks all. It sounds like I'll need an additional process to add further morphological features after parsing for many languages, which I hadn't thought about.
As for your specific questions, I would say that myself is Case=Acc. I would not mark case for mine because it can be used both as subject and object, without changing form.
myself makes sense as Case=Acc
, thanks. But I don't understand how mine can be a subject? I would've said it can only be used in a predicative context, or as the object of a preposition.
(the rule of thumb is: if your list of values contains all values available for the language (such as Number=Plur,Sing for English), then drop the feature entirely).
Okay, thanks. So, there's no distinction between "none of the above" and "any of the above", right?
Here's my current pronouns table for English:
https://github.com/honnibal/spaCy/blob/master/lang_data/en/morphs.json
I'm working on the auxiliaries now.
mine ... English is not my native language and it is quite possible that I am using it wrongly. I thought I could say something like: Your car is green. Mine is red. Is that ungrammatical?
Ah --- yeah, of course.
@fginter --- Thanks!!! That search engine is fantastically helpful.
For English, we have a translator from Penn Treebank trees to UPOS tags: https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon . It's a Tsurgeon translation file. Unlike for Google's universal POS, there are a number of instances where a translation from XPOS just isn't possible without seeing syntactic context.
Seems like basically all questions are answered, to the extent they will be ... so closing this.
Hi all,
I've been working on moving spaCy (an NLP pipeline, http://spacy.io ) to the UD scheme for some time now. I'm currently trying to produce some mapping files, which I think might also be of use to others. Some of what I'm doing might exist already --- if so, I'd appreciate pointers to the resources :). If not, I'm looking for a little help in making sure I'm applying the annotation schemes currently, particularly the morphological scheme.
First, as a point of terminology, is there a standard way to describe language/treebank-specific POS schemes, e.g. the VBZ, NNS etc scheme used in the PTB? For now, I'll call these tags "XPOS", for "extended POS tags". I'll reserve the term "POS tag" for one of the 17 UD POS tags.
As a second point of terminology, I'll call the text-field of an inflected token an "orthographic form" (as opposed to a lemma).
Often an XPOS tag maps to a single POS, and zero or more morphological features. I've found useful mapping tables like this in the Interset project. For my purposes, I'm formatting these as JSON files. I call this a "Tag Map": https://github.com/honnibal/spaCy/blob/master/lang_data/en/tag_map.json . Example entry:
Sometimes the XPOS doesn't map cleanly, e.g. the TO tag in the Penn Treebank. So I want to have another mapping file, which is keyed by an XPOS and an orthographic form. This is also necessary for exceptional forms in the language, which encode additional morphological features in their orthographic form that are not captured by their XPOS. Personal pronouns seem a common example. An example entry:
(Question: What's the favoured lemma for personal pronouns? I've gone with the special value "-PRON-" above, but it'd be nice to align this with what other people are doing.)
The token/XPOS will still fail to map cleanly in some cases. My idea is to simply transform the XPOS scheme given the Treebank parse, if it doesn't. The idea is to see the XPOS as an arbitrary detail of the statistical modelling: a place to jointly predict the part-of-speech and morphology, which rule-based post-processing then transforms into the UD format.
Okay! Finally, the questions. First, three general questions:
Now some specifics:
Sorry this was so long!
Thanks, Matthew Honnibal