brown-uk / dict_uk

Project to generate POS tag dictionary for Ukrainian language
GNU General Public License v3.0
550 stars 71 forks source link

A general question about this project #272

Closed stjoanis closed 3 years ago

stjoanis commented 3 years ago

First of all, I must thank you for your project, my wife is working on Ukrainian verbs and I'm trying to make things for her. So then I have a few questions : What's the Graddle environment ? I suppose the code is in Java. I tried to make this project work on CentOS-7 but got a Java UnsupportedClassVersionError # gradlew expand

It's a JNI error during the task Autogen

As I'm not ease with Java I switched to Windows. But I did not find the bin/expand_win.sh Anyway I launched gradlew.bat that brought in some Java env, it seems. But I feel totally lost with no idea what to do and what to expect as a result.

Finally, my wife told me that she's trying to find lemmas like this one -> близнюк /n20.a.p.ke.<

Does the file out/lemmas.txt contains such information ? If yes, then I'd like to generate it, or find a common version of it. I found the dict_corp_vis.txt.bz2 which seems a comprehensive data and my guess is that this project aims at generating lang data such as this one but maybe in a custom format to easily use for particular cases. Many thanks if someone can guide me to get a way around.

Emmanuel

arysin commented 3 years ago

Hi, gradle environment is a build system for Java projects. At some point a program to generate all Ukrainian word forms was a simple script but then it grew into very comlicated system and needed a build system support. Java 11 is required as of recently and I am not sure if it's available on Centos 7. It's been a while since I did this on windows so I'll need to dig up that bin/expand_win.sh script (apologies).

In genera; data/dict contains initial lemmas with inflection flags, e.g. for verbs it'd be /v[1-5] or for reflective ones /vr[1-5] there are also some exceptions in exceptions.lst (where inflections are too irregular).

If you're looking for verb lemmas out/dict_corp_vis.txt is the file with all the generated forms, and yes dict_corp_vis.txt.bz2 on github is a full list (maybe couple of versions behind though but should be over 400k lemmas anyway). In the dict_corp_vis.txt you'd be looking for rows that don't start with space (those are inflections) and have "verb:" tag in it.

HTH Andriy

stjoanis commented 3 years ago

So your help was usefull, the Java version unlocked me, and I was able to build the project. I also found some material I searched in \dict_uk\data\dict As for now I feel right and will explore all that later. Many thanks