bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Reverse Lemmatisation? #89

Open blhills opened 3 years ago

blhills commented 3 years ago

Hey Jan, thanks for the awesome work. Been using the R package to handle lemmatisation on media corpora for multiple Central and Eastern European languages, however, I am wondering if there is a way to essentially reverse the process.

so I can run this:

library(udpipe)

udmodel <- udpipe_download_model(language = "croatian")

x <- udpipe(x = "izbori izbore izbora izborima", object = udmodel)

x

doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos 1 doc1 1 1 izbori izbore izbora izborima 1 6 1 1 izbori izbor VERB Vmr3s 2 doc1 1 1 izbori izbore izbora izborima 8 13 2 2 izbore izbor NOUN Ncmpa 3 doc1 1 1 izbori izbore izbora izborima 15 20 3 3 izbora izbor NOUN Ncmpg 4 doc1 1 1 izbori izbore izbora izborima 22 29 4 4 izborima izbor NOUN Ncmpd

but what I would like is a way I can do something like

x <- udpipe(x = "izbor", object = udmodel)

and have it return the list of "izbori, izbore, izbora, izborima"

Is this possible?

jwijffels commented 3 years ago

Hello blhills, no it is currently not possible in the API to generate all inflected forms of a lemma. The lemma rules are in the C++ code but deeply behind the general API. Maybe we can ask this in the morphodita github repository.

jwijffels commented 3 years ago

@foxik is there a part in the morphodita C++ API which allows for generating all possible inflected forms of a lemma or can it be easily accessed on the UDPipe C++ API?

foxik commented 3 years ago

MorphoDiTa offers such a functionality https://ufal.mff.cuni.cz/morphodita/api-reference#morpho_generate , but it needs a morphological dictionary (which we have only for Czech, Slovak and English). I.e., UDPipe models do not have any idea of "valid forms for a given lemma" -- they are designed only for analysis using rules like "remove -ed" (and let the tagger to choose a valid result); for generation, these rules create a lot of invalid forms for a given lemma...

jwijffels commented 3 years ago

Thank you Milan.

@blhills I think the easiest is that on your corpus of news articles, you do the lemmatisation and keep the generated token/lemma combinations.

blhills commented 3 years ago

hmm yeah so just build out my own dataset of lemmas+inflections and call that dataset when i want to find the appropriate words.

Its one of those things that logically seemed pretty simple so thought perhaps i had overlooked a way of doing it in the package.

Anyway thanks for the help and the great package it is one of the best tools i have found for the work im doing.