Open PolinaZulik opened 2 years ago
I suggest to scrape or just copy-paste Babenko's dictionary to use it locally.
@PolinaZulik , here are some results. :-)
That's it. Feel free to comment on the results!
please output the statistics: how many occurences are classified and how many are 0? how many uniqe words are classified and how many are 0? what are your suggestions to add more verbs? do you think we could identify much more verbs with automatic methods (e.g. typos, aspect), or are most of NULLs because the verbs are just absent from the vocabulary and we'd have to identify them manually?
@PolinaZulik , here is a code for calculating statisics (it is in the same folder). The results are as follows:
Of course, ~30% of NULL occurences leave much to be desired. As I have already mentioned, it is possible that we should use Wiktionary for obtaining more verbs of the other aspect, but 1) its HTML-structure is too sophisticated and non-unified to extract links for the verbs of the other aspect, and 2) there would be no great differences if we compare Reverso and Wiktionary, imho. As our verbs in the files are almost of everyday usage, these both sites can be used.
I'm not sure but we can try to correct all the typos/colloquial forms (if any) manually and run the algorithms one more time, but I don't think we will see much improvement.
please use the data in this folder. add a category column. for every verb in the lemma column, please add category number and name from Babenko's dictionary. for example, решить: 1.6.7. Предложения, отображающие ситуацию решения. note that some verbs are absent from the dictionary in the current tense; please change tense if needed (решить->решать) with pymorphy.inflect, or wiktionary, or whatever you like. only change it for internal processing; leave my columns as they are. if there are many categories for a verb, add them in frequency order. e.g. Парить:
will have 1.1.1.5, 1.5.2.1, 2.2.2.2, 2.2.4.1. e.g. решать occurs 4 times in the dictionary, but every time with the same 1.6.7 category, so it'll only get 1.6.7.
you can replace my files in my folder if you like - so we don't duplicate data and waste disk space. just be careful and make sure existing columns are not changed. for that, I'd suggest testing your script on separate files first.