Open blhills opened 3 years ago
Hello blhills, no it is currently not possible in the API to generate all inflected forms of a lemma. The lemma rules are in the C++ code but deeply behind the general API. Maybe we can ask this in the morphodita github repository.
@foxik is there a part in the morphodita C++ API which allows for generating all possible inflected forms of a lemma or can it be easily accessed on the UDPipe C++ API?
MorphoDiTa offers such a functionality https://ufal.mff.cuni.cz/morphodita/api-reference#morpho_generate , but it needs a morphological dictionary (which we have only for Czech, Slovak and English). I.e., UDPipe models do not have any idea of "valid forms for a given lemma" -- they are designed only for analysis using rules like "remove -ed" (and let the tagger to choose a valid result); for generation, these rules create a lot of invalid forms for a given lemma...
Thank you Milan.
@blhills I think the easiest is that on your corpus of news articles, you do the lemmatisation and keep the generated token/lemma combinations.
hmm yeah so just build out my own dataset of lemmas+inflections and call that dataset when i want to find the appropriate words.
Its one of those things that logically seemed pretty simple so thought perhaps i had overlooked a way of doing it in the package.
Anyway thanks for the help and the great package it is one of the best tools i have found for the work im doing.
Hey Jan, thanks for the awesome work. Been using the R package to handle lemmatisation on media corpora for multiple Central and Eastern European languages, however, I am wondering if there is a way to essentially reverse the process.
so I can run this:
but what I would like is a way I can do something like
x <- udpipe(x = "izbor", object = udmodel)
and have it return the list of "izbori, izbore, izbora, izborima"
Is this possible?