dchaplinsky / LT2OpenCorpora

Python script to convert ukrainian morphological dictionary to OpenCorpora format. Script runs well under PyPy and also collects some stats/insights/anomalies in the dicts. Use on your own risk.
MIT License
12 stars 9 forks source link

Better compatibility with OpenCorpora format: tagset restrictions. #4

Open dchaplinsky opened 9 years ago

dchaplinsky commented 9 years ago

The OpenCorpora dictionary format has an option to specify list of restrictions on tag usage. https://github.com/kmike/pymorphy2/blob/master/dev_data/toy_dict.xml#L116

Basically they are saying which tags can (or cannot) go with other ones:

For example, verbs cannot have cases, like nouns, etc.

igor-tytyk commented 9 years ago

As I mentioned in the email, I created a python dictionary where a key:value pair looks as following: 'post' : {'obligatory': [grammar_cat1, grammar_cat2], 'maybe': [grammar_cat3, grammar_cat4]} 'noun': {'obligatory': ['case', 'nmbr'], 'maybe': ['gndr', 'extra']}

Using the mapping.csv, it should be possible to generate those xml 'restrictions'. The mapping will be used for replacing the grammatical categories with their values.

dchaplinsky commented 9 years ago

Great.

igor-tytyk commented 9 years ago

I wrote a piece of code that generates those restrictions, except for the first one (the one that says that a POST is obligatory).

However, of course, I have some questions. 1) The restriction map I wrote myself. It definitely needs an upgrade from some Ukrainian linguists (I probably will talk to Mariana about that tomorrow) 2) In the link you provided, the xml elements have certain format:

NOUNANim

I am not sure what are those "left" "right". I assumed that left is a POS and the right is the grammar category that is required/possible for it to have (again, except for the first restriction in the example). Basically, I copied this format.

dchaplinsky commented 9 years ago

Ok, create a pull request then.

@kmike, can you address the second question?

igor-tytyk commented 9 years ago

Here is the python dict with named tuples (obligatory - grammar categories that the POS must have, maybe - can have):

               {'noun': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr', 'extra']),
               'pron': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['extra']),
               'adj': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr','forms', 'extra']),
               'adjp': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr','forms', 'extra']),
               'verb': GrammemeSet(obligatory=['pers', 'nmbr', 'aspc', 'tense'], 
                                               maybe=['mood', 'verb_type', 'req_case', 'trns']),
               'adv': GrammemeSet(obligatory=[], maybe=['forms', 'extra']),
               'advp': GrammemeSet(obligatory=['aspc'], maybe=['extra']),
               'conj': GrammemeSet(obligatory=['conj_type'], maybe=['extra']),
               'numr': GrammemeSet(obligatory=['case'], maybe=['extra']),
               'predic': GrammemeSet(obligatory=[], maybe=['extra']),
               'insert': GrammemeSet(obligatory=[], maybe=['extra']),
               'prep': GrammemeSet(obligatory=[], maybe=['req_case', 'extra']),
               'excl': GrammemeSet(obligatory=[], maybe=['extra']),
               'part': GrammemeSet(obligatory=[], maybe=['extra'])}
dchaplinsky commented 9 years ago

Question: have you had a chance to check if all forms in dictionary conforms to this restrictions?

igor-tytyk commented 9 years ago

As for the 'obligatory' ones, I checked in the 1000 excerpt if the corresponding POS has the grammeme in all cases. Concerning the 'maybe' ones, it's also observations in the excerpt and my intuition. For example, I added 'extra' everywhere assuming that any word can be 'obsolete', 'dialect'; or I added 'req_case' to the POS 'prep' since prepositions do require certain case.

igor-tytyk commented 9 years ago

I could improve the mapping if I had access to the full dictionary.

igor-tytyk commented 9 years ago

Also, it's possible to extend the mapping with the values of the grammar categories (now, there is 'case', but no _'vnaz', _'vdav'), but I am not sure if this is needed. In the pymorphy example the have 'case' for all cases, but also they have 'anim' and 'inanim', whereas in our mapping.csv 'ist' is under 'extra'.

mariana-scorp commented 9 years ago

Hi guys!

A few comments here:

igor-tytyk commented 9 years ago

M: "I think we may add some lower-level restrictions, for example, it's obligatory for a singular noun, a singular adjective and a singular adjp to have gender, etc. Would it be useful?"

It would be helpful to have an exhaustive list of restrictions (or at least all types of restricions, e.g. the ones that are conditioned not only on a pos-tag, but also on grammar categories, and/or their values)

mariana-scorp commented 9 years ago

The first variant lives here now: https://github.com/dchaplinsky/LT2OpenCorpora/blob/master/lt2opencorpora/tag_restrictions.txt

dchaplinsky commented 9 years ago

:+1:

kmike commented 9 years ago

Hi,

To clarify: pymorphy2 does not use restrictions from the dictionary; they are used by OpenCorpora UI and added to the XML export for completeness. Initially I thought it would be possible to use them to help with the inflection, but it turns out OpenCorpora restrictions are not helpful here.

It seems that in the restrictions "left" are grammemes common for all forms in a lexeme, and "right" are grammemes that can differ. In OpenCorpora tags these grammemes are separated by a space, e.g. NOUN,anim,masc sing nomn. "left" is what is in<l> elements, right is what is in <f> elements. I'm not sure why is it implemented like this, and I can't remember what do "lemma" and "form" restrictions mean.

Following the OpenCorpora format for restrictions doesn't help pymorphy2; feel free to drop this section from the dictionary. It is not even parsed by pymorphy2.