label:enhancement uzbek lemma in search engines

AzamkhonKh commented 2 years ago

Hi, see your research looks good, but poor in data. I have a kind a pet project to finding out same semantic phrases in data and struggling to get lemma dataset of uzbek lang. Could you help me to solve this problem or any suggestion to do this ?

I was looking for manticore search engine with a stemm algorithm, but there is not uzbek lib.

Write to me if I can be helpful.

you can reach me by: mail: azamkhon.kh@gmail.com telegram: azamkhon_kh

elmurod1202 commented 2 years ago

Assalomu alaykum Azamxon,

First of all, thanks for the interest in the humble work o fours, and we'd be more than happy if we can help you by any means.

As far as I understood, you need a stemming tool for Uzbek, right? If so, I would suggest the Apertium tool with Uzbek monolingual package. For the time being, it's the best way to do morphological analysis for Uzbek, which you can easily turn it into stemmer as well. The link to the git repo: [here](https://github.com/apertium/apertium-uzb) . Installation and usage are well explained in the repo itself. Instructions to make it a stemming tool:

You can use --uzb-morph flag to get a morphological analysis of a given text in uzbek, but the problem is that it sometimes produces more than one result for each word, as it is possible to have more than one options (ambiguation problem);
Better solution: to use "--uzb-tagger" flag, so the tool applies morphological disambiguation feature itself, and produces the SINGLE best option for a given text.
The rest is as simple as stripping splitting, because the answer string contains stem of the word and morphmes separated for each word, you can split it into any shape you want.

If you are just looking for a list of Uzbek lemmas, then the file in the repo: "apertium-uzb.uzb.lexc" contains more than 60 000 lemmas (it starts somewhere after about 200 lines in the file), and it is not hard to separate them.

Best of luck, and let us know if we were helpful. Regards, Elmurod

AzamkhonKh commented 2 years ago

Well thank you for quick response. I was looking for a plug and play solution, like ukrainan lemma in manticore, moreover your solution requires some time to research this field and do some job (interesting field for me), but if i do solution with aperitum i should do kind a library for real-time analyser (with script duration 3 seconds in max) and i guess it would not finish in 2 monthes.

P.S: I did not mentioned earlier, that I am newbie in this sphere and I do not know how to do things. The core problem is to find semantic same phrases in sentences like

"Tinchlik paytda", men juda baxtli edim. and "Sokin davrda", bolalar ko'chada koptok o'ynashar edi.

to excrat that kind of senteces from comment or any kind of text

elmurod1202 commented 2 years ago

Dear Azamxon, We assume you are interested in the field, and I guess that you are doing it more of as a necessity, rather than just a curious interest. First of all, to answer your questions:

It is sad to say that there is no (as far as we know) a simple plug-and-play solution for a stemmer. But, the good news is that Apertium is as easy as that Ukrainian Manticore you mentioned. Since you seem to have a working experience of linux-like OS, you don't find it hard to install Apertium on your Ubuntu/Debian system. And the tool itself works on a command-line interface (Terminal/CLI/... whatever you want to call it). The instruction of Apertium installation and running is here: https://wiki.apertium.org/wiki/Installation .
Once you have apertium installed, and Uzbek package downloaded (or cloned from the git), you can easily get Uzbek sentences tagged, and the output of it can easily be piped to a simple bash script (one or two line maximum) to extract stems of given text (So you basically make the stemmer with just a single line of bash script).
Now to the rest of what you said: Could you explain why exactly are you trying to find semantically similar phrases in Uzbek? Do you for in a private sector, or are you also a researcher like us. Either way, if you introduce yourself more, it would be easier for us to explain the rest, like how to achieve the goal the fastest way (but not explaining everything), or do you need more scientific background for every step you follow (if you are doing research).

Furthermore, the task you are looking is called "Automatic identification and extraction of semantically similar words and phrases", and please, and this is totally different work than what you can expect in this repo. Some suggestion steps to make your intention work:

Extract a corpus, or use the available one;
Tokenize the corpus into chunks of the size you want;
Split texts into word n-grams (1,2,3 or even 4 word grams would be best for your case);
Create word n-gram embeddings;
Search for cosine similarity between all word n-grams and decide the threshold of acceptance;
Output those phrases where cosine similarity between them are above that threshold;
Voila, you have created a semantic similar phrase extractor.

To conclude, when you judge someone's work, please don't rush to conclude that it is "POOR IN DATA", if we included the data you are looking for, we would make it a new repo. I hope we were helpful.

We would like to hear from you more.
You can write me directly if you want. My telegram number: +34 698 37 41 59 (Elmurod)

Best of luck

AzamkhonKh commented 2 years ago

Thank you, for your clarifications and suggestions, i think i understand what to do next and closing this issue.

about judgement. I apologise if i injured someone, i did not seeking this effect. I said in mind "poor in data" in comparison to english lemma which i found (1 962 vs 84 487).

i understand that is large amount of work to do this number and i am glad to see that kind of work in open source.

hope you understand me in correct way

Best regards, Azamkhon.

UlugbekSalaev / SimRelUz

label:enhancement uzbek lemma in search engines #3