Open dchaplinsky opened 9 years ago
As I mentioned in the email, I created a python dictionary where a key:value pair looks as following: 'post' : {'obligatory': [grammar_cat1, grammar_cat2], 'maybe': [grammar_cat3, grammar_cat4]} 'noun': {'obligatory': ['case', 'nmbr'], 'maybe': ['gndr', 'extra']}
Using the mapping.csv, it should be possible to generate those xml 'restrictions'. The mapping will be used for replacing the grammatical categories with their values.
Great.
I wrote a piece of code that generates those restrictions, except for the first one (the one that says that a POST is obligatory).
However, of course, I have some questions. 1) The restriction map I wrote myself. It definitely needs an upgrade from some Ukrainian linguists (I probably will talk to Mariana about that tomorrow) 2) In the link you provided, the xml elements have certain format:
I am not sure what are those "left" "right". I assumed that left is a POS and the right is the grammar category that is required/possible for it to have (again, except for the first restriction in the example). Basically, I copied this format.
Ok, create a pull request then.
@kmike, can you address the second question?
Here is the python dict with named tuples (obligatory - grammar categories that the POS must have, maybe - can have):
{'noun': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr', 'extra']),
'pron': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['extra']),
'adj': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr','forms', 'extra']),
'adjp': GrammemeSet(obligatory=['case', 'nmbr'], maybe=['gndr','forms', 'extra']),
'verb': GrammemeSet(obligatory=['pers', 'nmbr', 'aspc', 'tense'],
maybe=['mood', 'verb_type', 'req_case', 'trns']),
'adv': GrammemeSet(obligatory=[], maybe=['forms', 'extra']),
'advp': GrammemeSet(obligatory=['aspc'], maybe=['extra']),
'conj': GrammemeSet(obligatory=['conj_type'], maybe=['extra']),
'numr': GrammemeSet(obligatory=['case'], maybe=['extra']),
'predic': GrammemeSet(obligatory=[], maybe=['extra']),
'insert': GrammemeSet(obligatory=[], maybe=['extra']),
'prep': GrammemeSet(obligatory=[], maybe=['req_case', 'extra']),
'excl': GrammemeSet(obligatory=[], maybe=['extra']),
'part': GrammemeSet(obligatory=[], maybe=['extra'])}
Question: have you had a chance to check if all forms in dictionary conforms to this restrictions?
As for the 'obligatory' ones, I checked in the 1000 excerpt if the corresponding POS has the grammeme in all cases. Concerning the 'maybe' ones, it's also observations in the excerpt and my intuition. For example, I added 'extra' everywhere assuming that any word can be 'obsolete', 'dialect'; or I added 'req_case' to the POS 'prep' since prepositions do require certain case.
I could improve the mapping if I had access to the full dictionary.
Also, it's possible to extend the mapping with the values of the grammar categories (now, there is 'case', but no _'vnaz', _'vdav'), but I am not sure if this is needed. In the pymorphy example the have 'case' for all cases, but also they have 'anim' and 'inanim', whereas in our mapping.csv 'ist' is under 'extra'.
Hi guys!
A few comments here:
M: "I think we may add some lower-level restrictions, for example, it's obligatory for a singular noun, a singular adjective and a singular adjp to have gender, etc. Would it be useful?"
It would be helpful to have an exhaustive list of restrictions (or at least all types of restricions, e.g. the ones that are conditioned not only on a pos-tag, but also on grammar categories, and/or their values)
The first variant lives here now: https://github.com/dchaplinsky/LT2OpenCorpora/blob/master/lt2opencorpora/tag_restrictions.txt
:+1:
Hi,
To clarify: pymorphy2 does not use restrictions from the dictionary; they are used by OpenCorpora UI and added to the XML export for completeness. Initially I thought it would be possible to use them to help with the inflection, but it turns out OpenCorpora restrictions are not helpful here.
It seems that in the restrictions "left" are grammemes common for all forms in a lexeme, and "right" are grammemes that can differ. In OpenCorpora tags these grammemes are separated by a space, e.g. NOUN,anim,masc sing nomn
. "left" is what is in<l>
elements, right
is what is in <f>
elements. I'm not sure why is it implemented like this, and I can't remember what do "lemma" and "form" restrictions mean.
Following the OpenCorpora format for restrictions doesn't help pymorphy2; feel free to drop this section from the dictionary. It is not even parsed by pymorphy2.
The OpenCorpora dictionary format has an option to specify list of restrictions on tag usage. https://github.com/kmike/pymorphy2/blob/master/dev_data/toy_dict.xml#L116
Basically they are saying which tags can (or cannot) go with other ones:
For example, verbs cannot have cases, like nouns, etc.