PolinaZulik / metaphor-psycho

0 stars 0 forks source link

Get lexico-syntactic categories of verbs #4

Open PolinaZulik opened 2 years ago

PolinaZulik commented 2 years ago

please use the data in this folder. add a category column. for every verb in the lemma column, please add category number and name from Babenko's dictionary. for example, решить: 1.6.7. Предложения, отображающие ситуацию решения. note that some verbs are absent from the dictionary in the current tense; please change tense if needed (решить->решать) with pymorphy.inflect, or wiktionary, or whatever you like. only change it for internal processing; leave my columns as they are. if there are many categories for a verb, add them in frequency order. e.g. Парить:

image

will have 1.1.1.5, 1.5.2.1, 2.2.2.2, 2.2.4.1. e.g. решать occurs 4 times in the dictionary, but every time with the same 1.6.7 category, so it'll only get 1.6.7.

you can replace my files in my folder if you like - so we don't duplicate data and waste disk space. just be careful and make sure existing columns are not changed. for that, I'd suggest testing your script on separate files first.

PolinaZulik commented 2 years ago

I suggest to scrape or just copy-paste Babenko's dictionary to use it locally.

Wheatley961 commented 2 years ago

@PolinaZulik , here are some results. :-)

  1. This folder contains all the files that were used or created.
  2. There are two sub-folders. The one with the old index is just a copy of yours. The one with the new index contains all the files with Babenko's columns added.
  3. To start with, I just copy-pasted all the verbs from the site, then I used a code titled Обработка_скопированного... to save it in a csv/Excel-format. They are titled Babenko.
  4. All the main processing procedures were made with the help of a code titled AddNewColumns. There were several stages. First of all, I used bs4 and requests to obtain names of Babenko's number categories. That's why I also created two additional files titled Babenko_final. I just updated the values of the second column. They contain number categories and their names. Then I opened each of your files iteratively and added Babenko's value to the last column. Unfortunately, I wasn't able to set values for some lemmata for some reasons. The first one is typos. The second one is the absence of a verb in a dictionary. Finally, it might be because of the absence of a verb with the other aspect on the site I used (it was easier to apply bs4 to Reverso.net than Wiktionary as the HTML-structure of the latter one is hard to deal with). The inflect function in pymorphy2 failed to work properly. In all the cases NULL value was assigned to a lemma.

That's it. Feel free to comment on the results!

PolinaZulik commented 2 years ago

please output the statistics: how many occurences are classified and how many are 0? how many uniqe words are classified and how many are 0? what are your suggestions to add more verbs? do you think we could identify much more verbs with automatic methods (e.g. typos, aspect), or are most of NULLs because the verbs are just absent from the vocabulary and we'd have to identify them manually?

Wheatley961 commented 2 years ago

@PolinaZulik , here is a code for calculating statisics (it is in the same folder). The results are as follows:

  1. All the wordforms of the verbs in all the files: Total number of words: 26388 Total number of NULL occurences: 7310
  2. All the _unique wordforms_ of the verbs in all the files: Total number of words: 7535 Total number of NULL occurences: 2776
  3. All the _unique tokens_ of the verbs in all the files (mind that some lemmata in the original files were recognized as incorrect, needs proofreading): Total number of lemmata: 3331 Total number of NULL occurences: 1692

Of course, ~30% of NULL occurences leave much to be desired. As I have already mentioned, it is possible that we should use Wiktionary for obtaining more verbs of the other aspect, but 1) its HTML-structure is too sophisticated and non-unified to extract links for the verbs of the other aspect, and 2) there would be no great differences if we compare Reverso and Wiktionary, imho. As our verbs in the files are almost of everyday usage, these both sites can be used.

I'm not sure but we can try to correct all the typos/colloquial forms (if any) manually and run the algorithms one more time, but I don't think we will see much improvement.