Open juanjoDiaz opened 3 months ago
These are just German examples so the affix search works, I wouldn't add it for other languages, or maybe I don't understand the question?
Depending on the language the UD data does not include a lot of variation, real-world scenarios are different, UD is just a way to run tests. Languages like German or Finnish will have much more words outside of the dictionary than French or Spanish for example.
My the questions are: Why is German not included in the affix search by default as some other languages? Should we include it?
I see! There are already rules for German, I guess affixes are not included because it would harm precision but I'll check again.
The performance is degraded in German for non-greedy searches and the affixes don't improve the greedy mode. They also slows down things a bit. So I would be against it and in favor of adding more rules if necessary.
What's dataset are you using to measure performance? I measure with the evaluation script and the lates UD Treebanks as defined in the readme but the results were exactly the same wether affix was used or not.
What kind of rules do you think that would help with my examples?
I see a difference when I add "de"
to AFFIX_LANGS
and lemmatize with greedy=False
, the rest is the same.
I checked again, your examples rather hint at a failing decomposition into subwords. "Distanz" and "Stück" are in the language data, thus it should be possible to break the token apart and find the right ending. I'm not sure why it happens here.
My best guess is that it's just because the decomposition strategy is missing, there can be several ways to break words apart. That being said it could be worth it to implement this:
Isn't that how the affix decomposition strategy works already?
The problem is that the strategy is not applied to German. I could simply solve this by adding German to the list of languages that use the affix decomposition strategy.
The question is also: why nos enabling the strategy to all languages?
Because it degrades performance on the benchmark, at least in my experiments. Currently the script only evaluates accuracy, it could look different with a f-score but I'm not sure.
In German, morphology of compound words can be complex, with (often 0 or 1 but) up to 3 characters between the parts of a compound. I guess it's even trickier for other morphologically rich languages. So the approach used for affixes reaches its limits there and does not entail error-free decomposition.
My best guess is that the method needed to solve your cases is a compound splitting strategy. It is not explicitly included in Simplemma (only through through the affix decomposition) but it would be a nice addition to strategies/
.
It would be the same idea as for the affixes but with a further search until two or more valid parts (i.e. dictionary words) are found. We would need to specify how many characters are allowed between the components, I suggest to do it empirically, by testing.
I see. How could we try such new strategy if these compounded words are not present in UD Treebanks? (If they where, the precision will improve adding the affix search).
Yes, evaluating performance on rare words is an issue. We can implement the additional strategy and let the users decide if they want to use it.
Hi @adbar ,
Recently I started noticing that some inflected words are not correctly lemmatized. However, when adding German to the list of languages that are processed by the affix decomposition strategy, most of this are solved.
Here are some examples:
Adding the affix strategy to German, does increase the execution time a bit but doesn't change the precision numbers when executing the evaluation script against the latest UD treebanks. But, to be honest, even removing all the rules and only keeping the dictionary lookup barely changes the evaluation results 😅
So, the questions are: Why is German not included in the affix search? Should we include it?