Better compatibility with OpenCorpora format: tagset restrictions.

dchaplinsky commented 9 years ago

The OpenCorpora dictionary format has an option to specify list of restrictions on tag usage. https://github.com/kmike/pymorphy2/blob/master/dev_data/toy_dict.xml#L116

Basically they are saying which tags can (or cannot) go with other ones:

For example, verbs cannot have cases, like nouns, etc.

igor-tytyk commented 9 years ago

As I mentioned in the email, I created a python dictionary where a key:value pair looks as following: 'post' : {'obligatory': [grammar_cat1, grammar_cat2], 'maybe': [grammar_cat3, grammar_cat4]} 'noun': {'obligatory': ['case', 'nmbr'], 'maybe': ['gndr', 'extra']}

Using the mapping.csv, it should be possible to generate those xml 'restrictions'. The mapping will be used for replacing the grammatical categories with their values.

dchaplinsky commented 9 years ago

Great.

igor-tytyk commented 9 years ago

I wrote a piece of code that generates those restrictions, except for the first one (the one that says that a POST is obligatory).

However, of course, I have some questions. 1) The restriction map I wrote myself. It definitely needs an upgrade from some Ukrainian linguists (I probably will talk to Mariana about that tomorrow) 2) In the link you provided, the xml elements have certain format:

NOUNANim

I am not sure what are those "left" "right". I assumed that left is a POS and the right is the grammar category that is required/possible for it to have (again, except for the first restriction in the example). Basically, I copied this format.

dchaplinsky commented 9 years ago

Ok, create a pull request then.

@kmike, can you address the second question?

igor-tytyk commented 9 years ago

Here is the python dict with named tuples (obligatory - grammar categories that the POS must have, maybe - can have):

               {'noun': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr', 'extra']),
               'pron': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['extra']),
               'adj': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr','forms', 'extra']),
               'adjp': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr','forms', 'extra']),
               'verb': GrammemeSet(obligatory=['pers', 'nmbr', 'aspc', 'tense'], 
                                               maybe=['mood', 'verb_type', 'req_case', 'trns']),
               'adv': GrammemeSet(obligatory=[], maybe=['forms', 'extra']),
               'advp': GrammemeSet(obligatory=['aspc'], maybe=['extra']),
               'conj': GrammemeSet(obligatory=['conj_type'], maybe=['extra']),
               'numr': GrammemeSet(obligatory=['case'], maybe=['extra']),
               'predic': GrammemeSet(obligatory=[], maybe=['extra']),
               'insert': GrammemeSet(obligatory=[], maybe=['extra']),
               'prep': GrammemeSet(obligatory=[], maybe=['req_case', 'extra']),
               'excl': GrammemeSet(obligatory=[], maybe=['extra']),
               'part': GrammemeSet(obligatory=[], maybe=['extra'])}

dchaplinsky commented 9 years ago

Question: have you had a chance to check if all forms in dictionary conforms to this restrictions?

igor-tytyk commented 9 years ago

As for the 'obligatory' ones, I checked in the 1000 excerpt if the corresponding POS has the grammeme in all cases. Concerning the 'maybe' ones, it's also observations in the excerpt and my intuition. For example, I added 'extra' everywhere assuming that any word can be 'obsolete', 'dialect'; or I added 'req_case' to the POS 'prep' since prepositions do require certain case.

igor-tytyk commented 9 years ago

I could improve the mapping if I had access to the full dictionary.

igor-tytyk commented 9 years ago

Also, it's possible to extend the mapping with the values of the grammar categories (now, there is 'case', but no _'vnaz', _'vdav'), but I am not sure if this is needed. In the pymorphy example the have 'case' for all cases, but also they have 'anim' and 'inanim', whereas in our mapping.csv 'ist' is under 'extra'.

mariana-scorp commented 9 years ago

Hi guys!

A few comments here:

Igor, you are right that the 'extra' tag may be present for any part of speech.
Add ANim as an obligatory category for nouns. In the next version of the dictionary, the 'ist' tag will be renamed to 'anim', and the 'inanim' tag will be added, so these tags will have their separate category and won't need to live among 'extras'. Nouns that can be either animate or inanimate will possess both of these tags.
It's good that the 'nmbr' category is obligatory. We don't have the 's' (singular) tag for nouns yet (just the gender), but we've already discussed it, and it will be generated in the next version of the dictionary.
I think we may add some lower-level restrictions, for example, it's obligatory for a singular noun, a singular adjective and a singular adjp to have gender, etc. Would it be useful?
Additional tags for pronouns are under development now. I foresee different obligatory categories for each pronoun category. I'll keep you tuned.
It's obligatory for an adjp to have aspect ('aspc'). It is not ready yet, and we are working on it. Stay tuned)
I've just noticed that we are missing the number value for the verbs of the 3rd person singular in the past. I'll make sure we fix this as soon as possible.
The obligatory categories for verbs that you wrote out work for all verb forms except for infinitives and impersonal verbs. Probably, this is the reason why they have infinitive as a separate part of speech in OpenCorpora? I'm not sure how to address this issue properly, though.
There's no person value for the verbs in the past, for example, "співала" can refer to any person. I guess the person category can also be added to obligatory tags for the verbs in the present and in the future (lower-level restrictions again).
There's no tense value for the verbs in the imperative mood, for example, "співай". I guess the tense category can also be added to obligatory tags for the verbs in the indicative mood (theoretically speaking, we can add a tag for that, too).
Gender should be added as a 'maybe' category for verbs. We have it for the verbs in the past.
Verbs don't have the 'req_case' category, and we don't have any plans to add it soon. It could be useful, though.
The category of transitivity is removed from the dictionary for now, so don't be surprised when you don't see any examples. It was removed because the source Andriy used was unreliable. Nevertheless, I've talked to a guy today who says that his students annotated 20 thousand transitive verbs. If he shares this list with us, we will add this category back)
'req_case' is obligatory for prepositions.

igor-tytyk commented 9 years ago

M: "I think we may add some lower-level restrictions, for example, it's obligatory for a singular noun, a singular adjective and a singular adjp to have gender, etc. Would it be useful?"

Overall, I think it would be useful. If the dictionary is going to be modified or extended, the restrictions will help to prevent missing grammar info (or, depending on the interface, urge the person to fill out obligatory However, it seems that the guys in pymorphy decided to condition their restrictions solely on POS-tags. Therefore, the current scheme I worked out should be changed.

It would be helpful to have an exhaustive list of restrictions (or at least all types of restricions, e.g. the ones that are conditioned not only on a pos-tag, but also on grammar categories, and/or their values)

mariana-scorp commented 9 years ago

The first variant lives here now: https://github.com/dchaplinsky/LT2OpenCorpora/blob/master/lt2opencorpora/tag_restrictions.txt

dchaplinsky commented 9 years ago

:+1:

kmike commented 9 years ago

Hi,

To clarify: pymorphy2 does not use restrictions from the dictionary; they are used by OpenCorpora UI and added to the XML export for completeness. Initially I thought it would be possible to use them to help with the inflection, but it turns out OpenCorpora restrictions are not helpful here.

It seems that in the restrictions "left" are grammemes common for all forms in a lexeme, and "right" are grammemes that can differ. In OpenCorpora tags these grammemes are separated by a space, e.g. NOUN,anim,masc sing nomn. "left" is what is in<l> elements, right is what is in <f> elements. I'm not sure why is it implemented like this, and I can't remember what do "lemma" and "form" restrictions mean.

Following the OpenCorpora format for restrictions doesn't help pymorphy2; feel free to drop this section from the dictionary. It is not even parsed by pymorphy2.

dchaplinsky / LT2OpenCorpora

Better compatibility with OpenCorpora format: tagset restrictions. #4