UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
270 stars 245 forks source link

Mapping tables that go (Form, Treebank POS) --> (Lemma, UD POS, Morphological features) #212

Closed honnibal closed 8 years ago

honnibal commented 9 years ago

Hi all,

I've been working on moving spaCy (an NLP pipeline, http://spacy.io ) to the UD scheme for some time now. I'm currently trying to produce some mapping files, which I think might also be of use to others. Some of what I'm doing might exist already --- if so, I'd appreciate pointers to the resources :). If not, I'm looking for a little help in making sure I'm applying the annotation schemes currently, particularly the morphological scheme.

First, as a point of terminology, is there a standard way to describe language/treebank-specific POS schemes, e.g. the VBZ, NNS etc scheme used in the PTB? For now, I'll call these tags "XPOS", for "extended POS tags". I'll reserve the term "POS tag" for one of the 17 UD POS tags.

As a second point of terminology, I'll call the text-field of an inflected token an "orthographic form" (as opposed to a lemma).

Often an XPOS tag maps to a single POS, and zero or more morphological features. I've found useful mapping tables like this in the Interset project. For my purposes, I'm formatting these as JSON files. I call this a "Tag Map": https://github.com/honnibal/spaCy/blob/master/lang_data/en/tag_map.json . Example entry:

"NNS": {"pos": "noun", "number": "plur"}

Sometimes the XPOS doesn't map cleanly, e.g. the TO tag in the Penn Treebank. So I want to have another mapping file, which is keyed by an XPOS and an orthographic form. This is also necessary for exceptional forms in the language, which encode additional morphological features in their orthographic form that are not captured by their XPOS. Personal pronouns seem a common example. An example entry:

 "me/PRP":  {"lemma": "-PRON-", "PronType": "Prs", "Person": "One",   "Number": "Sing",  "Case": "Acc"}

(Question: What's the favoured lemma for personal pronouns? I've gone with the special value "-PRON-" above, but it'd be nice to align this with what other people are doing.)

The token/XPOS will still fail to map cleanly in some cases. My idea is to simply transform the XPOS scheme given the Treebank parse, if it doesn't. The idea is to see the XPOS as an arbitrary detail of the statistical modelling: a place to jointly predict the part-of-speech and morphology, which rule-based post-processing then transforms into the UD format.

Okay! Finally, the questions. First, three general questions:

  1. Any mapping tables of this form already created?
  2. Sources of equivalent information? Anything like this in the HPSG or LFG grammars?
  3. I've read the UD documentation, and looked over the Interset materials. Does it sound like I've missed some major documentation?

Now some specifics:

  1. In English, is "you" number unmarked, or is it Number=Sing,Plur? Equivalent question for case and gender. Are these clear/central applications of the annotation scheme, or are they difficult edge cases?
  2. Is "mine" Case=Acc?
  3. Is "myself" Case=Acc?

Sorry this was so long!

Thanks, Matthew Honnibal

fginter commented 9 years ago

Hi Matthew,

Can't speak for other languages and your question might be specifically targeting English, but FWIW this is what holds for Finnish:

  1. The mapping is LEMMA+XPOS+XFEAT -> POS+FEAT and in two cases even LEMMA+XPOS+XFEAT+DEPREL -> POS+FEAT (where X marks original data in the original treebank). Note that XFEAT is not present in the final UD file.
  2. The mapping is implemented in the dev-ud branch of the Finnish-dep-parser GitHub project here: https://github.com/TurkuNLP/Finnish-dep-parser/tree/dev-ud/morpho-sd2ud It's not a simple lookup table, but a set of scripts which transform the output of the Finnish morphological analyzer (XPOS+XFEAT) into the UD versions POS+FEAT.

Best,

Filip

honnibal commented 9 years ago

Thanks. I'm targeting multi-lingual. I'm hoping the simpler mapping will work for most languages, although I know some languages might need to work differently.

How does your lemmatization work? Do you go ORTH+XPOS -> LEMMA+XPOS+XFEAT?

fginter commented 9 years ago

Finnish lemmatization uses the two-level morphological analyzer OMorFi. That is distributed with the parser. So we go ORTH -> [morpho analyzer] -> several LEMMA+XPOS+XFEAT alternatives -> [bunch of scripts] -> several LEMMA+POS+FEAT alternatives. And then a CRF (Marmot) to disambiguate the competing readings.

ftyers commented 9 years ago

We were working on this for Kazakh at the TurkLang conference last week. Here is the spreadsheet we've been writing to be able to have a consistent mapping between our two standards (KNC, Apertium) and UD. Note that we conceive the mapping as unidirectional.

https://docs.google.com/spreadsheets/d/1Q4J3axzxYFFZebhfaCOvgtIy26OgpZcym-mVA_yX7FI/edit#gid=130548646

As Filip mentioned in the worst case you do need (LEMMA)+XPOS+XFEAT+DEPREL (for example to make a pronoun/determiner distinction or noun/adjective).

dan-zeman commented 9 years ago

@honnibal : These tables of yours could be quite useful and they can obviously improve over Interset. But as others mentioned, even with lemma it cannot be perfect. Beyond that, it is probably easier to write scripts than tables, and to consider the tree structure (that's what I do).

As for your specific questions, I would say that myself is Case=Acc. I would not mark case for mine because it can be used both as subject and object, without changing form. I wouldn't mark gender, number and case for you (the rule of thumb is: if your list of values contains all values available for the language (such as Number=Plur,Sing for English), then drop the feature entirely).

On the other hand, we have not settled on the scale between form and function as criteria for morphological (and POS) distinctions. For instance the German STTS tagset is very context-sensitive: the nouns inflect for case quite rarely (and usually the case is determined by the article), yet every noun has one of the four case values assigned. So if you are willing and able to disambiguate, you could distinguish between you that is Case=Nom and you that is Case=Acc. (In pure theory, the same could be done with Number, but I'm afraid that it would be often undecidable even for human annotators.)

honnibal commented 9 years ago

Thanks all. It sounds like I'll need an additional process to add further morphological features after parsing for many languages, which I hadn't thought about.

As for your specific questions, I would say that myself is Case=Acc. I would not mark case for mine because it can be used both as subject and object, without changing form.

myself makes sense as Case=Acc, thanks. But I don't understand how mine can be a subject? I would've said it can only be used in a predicative context, or as the object of a preposition.

(the rule of thumb is: if your list of values contains all values available for the language (such as Number=Plur,Sing for English), then drop the feature entirely).

Okay, thanks. So, there's no distinction between "none of the above" and "any of the above", right?

Here's my current pronouns table for English:

https://github.com/honnibal/spaCy/blob/master/lang_data/en/morphs.json

I'm working on the auxiliaries now.

dan-zeman commented 9 years ago

mine ... English is not my native language and it is quite possible that I am using it wrongly. I thought I could say something like: Your car is green. Mine is red. Is that ungrammatical?

honnibal commented 9 years ago

Ah --- yeah, of course.

fginter commented 9 years ago

All cases in UD English are here

honnibal commented 9 years ago

@fginter --- Thanks!!! That search engine is fantastically helpful.

manning commented 9 years ago

For English, we have a translator from Penn Treebank trees to UPOS tags: https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon . It's a Tsurgeon translation file. Unlike for Google's universal POS, there are a number of instances where a translation from XPOS just isn't possible without seeing syntactic context.

manning commented 8 years ago

Seems like basically all questions are answered, to the extent they will be ... so closing this.