Add a compound splitting strategy to improve on affix decomposition

juanjoDiaz commented 3 months ago

Hi @adbar ,

Recently I started noticing that some inflected words are not correctly lemmatized. However, when adding German to the list of languages that are processed by the affix decomposition strategy, most of this are solved.

Here are some examples:

Word	Real Lemma	Simplema Lemma		Simplemma with afix strategy for German
Motorschütz	Motorschütz	Motorschütz	✅	Motorschütz	✅
Motorschütze	Motorschütz	Motorschütze	❌	Motorschütz	✅
Motorschützes	Motorschütz	Motorschützes	❌	Motorschütz	✅
Motorschützen	Motorschütz	Motorschützen	❌	Motorschützen	❌
Motorschützüberwachung	Motorschützüberwachung	Motorschützüberwachung	✅	Motorschützüberwachung	✅
Motorschützüberwachungen	Motorschützüberwachung	Motorschützüberwachung	✅	Motorschützüberwachung	✅
Distanzstück	Distanzstück	Distanzstück	✅	Distanzstück	✅
Distanzstücke	Distanzstück	Distanzstücke	❌	Distanzstück	✅
Distanzstücks	Distanzstück	Distanzstücks	❌	Distanzstück	✅
Distanzstücken	Distanzstück	Distanzstücken	❌	Distanzstück	✅
Durchgangsprüfung	Durchgangsprüfung	Durchgangsprüfung	✅	Durchgangsprüfung	✅
Durchgangsprüfungen	Durchgangsprüfung	Durchgangsprüfung	✅	Durchgangsprüfung	✅

Adding the affix strategy to German, does increase the execution time a bit but doesn't change the precision numbers when executing the evaluation script against the latest UD treebanks. But, to be honest, even removing all the rules and only keeping the dictionary lookup barely changes the evaluation results 😅

So, the questions are: Why is German not included in the affix search? Should we include it?

adbar commented 3 months ago

These are just German examples so the affix search works, I wouldn't add it for other languages, or maybe I don't understand the question?

Depending on the language the UD data does not include a lot of variation, real-world scenarios are different, UD is just a way to run tests. Languages like German or Finnish will have much more words outside of the dictionary than French or Spanish for example.

juanjoDiaz commented 3 months ago

My the questions are: Why is German not included in the affix search by default as some other languages? Should we include it?

adbar commented 3 months ago

I see! There are already rules for German, I guess affixes are not included because it would harm precision but I'll check again.

adbar commented 3 months ago

The performance is degraded in German for non-greedy searches and the affixes don't improve the greedy mode. They also slows down things a bit. So I would be against it and in favor of adding more rules if necessary.

juanjoDiaz commented 3 months ago

What's dataset are you using to measure performance? I measure with the evaluation script and the lates UD Treebanks as defined in the readme but the results were exactly the same wether affix was used or not.

What kind of rules do you think that would help with my examples?

adbar commented 3 months ago

I see a difference when I add "de" to AFFIX_LANGS and lemmatize with greedy=False, the rest is the same.

I checked again, your examples rather hint at a failing decomposition into subwords. "Distanz" and "Stück" are in the language data, thus it should be possible to break the token apart and find the right ending. I'm not sure why it happens here.

adbar commented 3 months ago

My best guess is that it's just because the decomposition strategy is missing, there can be several ways to break words apart. That being said it could be worth it to implement this:

Start from the end until a valid subword is found
See if the other part of the token is in the dictionary
Apply the lemmatization to the identified subword at the end

juanjoDiaz commented 3 months ago

Isn't that how the affix decomposition strategy works already?

The problem is that the strategy is not applied to German. I could simply solve this by adding German to the list of languages that use the affix decomposition strategy.

The question is also: why nos enabling the strategy to all languages?

adbar commented 3 months ago

Because it degrades performance on the benchmark, at least in my experiments. Currently the script only evaluates accuracy, it could look different with a f-score but I'm not sure.

In German, morphology of compound words can be complex, with (often 0 or 1 but) up to 3 characters between the parts of a compound. I guess it's even trickier for other morphologically rich languages. So the approach used for affixes reaches its limits there and does not entail error-free decomposition.

My best guess is that the method needed to solve your cases is a compound splitting strategy. It is not explicitly included in Simplemma (only through through the affix decomposition) but it would be a nice addition to strategies/.

It would be the same idea as for the affixes but with a further search until two or more valid parts (i.e. dictionary words) are found. We would need to specify how many characters are allowed between the components, I suggest to do it empirically, by testing.

juanjoDiaz commented 3 months ago

I see. How could we try such new strategy if these compounded words are not present in UD Treebanks? (If they where, the precision will improve adding the affix search).

adbar commented 3 months ago

Yes, evaluating performance on rare words is an issue. We can implement the additional strategy and let the users decide if they want to use it.

adbar / simplemma

Add a compound splitting strategy to improve on affix decomposition #141