Words with german umlauts are not rendered properly.

Prospero256 commented 1 year ago

If you create an entry for a word with a german umlaut, such as "Währung" or "Küste" and don't activate case sensitivity, it is rendered with the umlaut repeating itself at the end of the highlighted word: Like "Währungä" or "Küsteü" for example.

featdd commented 1 year ago

Hi @Prospero256,

thanks for the hint, I found the issue here, the problem seems to be caused by str_replace, it messes up the regex for the umlauts.

In my testing environment I've tested umlauts with the word "Äpfel" which worked because somehow when the umlaut is at the beginning. This whole construct was a workaround to solve this issue.

I don't know how PHP internally iterates the string, maybe I have to encode the replacements to prevent multiple replacements.

I may need some time to find out how to solve this properly before having another dirty workaround.

Greetings Daniel

siegtob commented 1 year ago

For what is $termHasUmlauts needed? If I set this fix to 0, my words with Umlaute will be rendered without problems. Testet it with multiple Umlaute, characters at the start of the word, in the middle, single Umlaute.

featdd commented 1 year ago

Hi @siegtob,

this was a workaround for the inability of regex to recognize case insensitivity for umlauts. It seems like there is no internal mapping for Ä = ä like for other letters A = a, so even you've set the term to be matched case insensitive it still would only match case sensitive.

The workaround should prevented this replacing the umlauts with a or matching for both like (Ä|ä), but in this case str_replace seems to iteratively replace the single characters while don't update its index position while doing so. So it breaks the regular expression causing this strange umlaut at the end of the term.

I try to find some time on the weekend to solve this while maintain the ability to case insensitive match of umlauts.

Greetings Daniel

ErHaWeb commented 1 year ago

I am also struggling with this problem at the moment. I noticed in this together that the current replacement of german umlauts is also not done quite correctly. For example, the $quotedTerm "Qualitätsmanagement" becomes string "Qualit((Ä|ä)|ä)tsmanagement". A double substitution takes place.

Instead of the following code

$quotedTerm = str_replace(
    [
        'ä',
        'Ä',
        'ö',
        'Ö',
        'ü',
        'Ü',
    ],
    [
        '(Ä|ä)',
        '(Ä|ä)',
        '(Ö|ö)',
        '(Ö|ö)',
        '(Ü|ü)',
        '(Ü|ü)',
    ],
    $quotedTerm
);

perhaps you can use the following code, which, taking into account the UTF-8 character set (unlike str_split), passes through each character in turn

$len = mb_strlen($quotedTerm, 'UTF-8');
$chars = [];
for ($i = 0; $i < $len; $i++) {
    $chars[] = mb_substr($quotedTerm, $i, 1, 'UTF-8');
}
$quotedTerm = '';
foreach($chars as $char) {
    if (in_array($char, ['Ä', 'ä'])) {
        $char = '(Ä|ä)';
    }
    if (in_array($char, ['Ö', 'ö'])) {
        $char = '(Ö|ö)';
    }
    if (in_array($char, ['Ü', 'ü'])) {
        $char = '(Ü|ü)';
    }
    $quotedTerm .= $char;
}

in Classes/Service/ParserService.php

All cases in which only one German umlaut occurs are already corrected by this. Unfortunately, the situation is different for words that contain more than one German umlaut. Here the problem remains.

Example: The word "Qualitätsmanagement" is output correctly after the revision The word "Übersetzungsverhältnis" still becomes "Übersetzungsverhältnisä" The word "ÄÖÜäöü" becomes "ÄÖÜäöüÖ"

I hope these explanations help you. Thank you for solving the problem.

featdd commented 1 year ago

Hi @ErHaWeb,

thanks I saw that too when debugging, that is the problem I already explained above, str_replace is working a bit unexpected here, replacing the already replaced umlauts with another matching group, breaking the index of the replacement and so on.

This whole thing was a dirty workaround before, I just need to find some time to solve this properly. (The only thing my mind came up with was to use some other variables for the matching groups and have like two replacements, but it feels like a shitty solution)

Greetings Daniel

ErHaWeb commented 1 year ago

Unfortunately I don't have a better idea for this problem at the moment. But hey, a crappy solution is better than none at all 😉. Maybe sometime someone™ has a brainwave how to solve this more elegantly.

featdd commented 1 year ago

Hi @Prospero256, @siegtob, @ErHaWeb,

Maybe I should get a bit more worried about my job, I asked ChatGPT and it came up with a quite easy solution 😅 I just optimized the resulted code and it seems to work, but I also would be happy to hear your feedback.

Solved here: 776d3785b262f4635591e3d30ae0deb96ee7ccb0

Greetings Daniel

ErHaWeb commented 1 year ago

Thanks for the quick solution! Terms that contain only one umlaut are now handled correctly. However, I still notice a problem with terms containing two umlauts. For example, the term "Übersetzungsverhältnis" still becomes "Übersetzungsverhältnisä". Do you have any idea how to solve this case as well?

featdd commented 1 year ago

Hi @ErHaWeb,

my bad, I missed to update the index with the amount of umlauts found. I quickly solved this but could you please verify this also works for you: 2d4a726c31a3274c2e914ba19aa0a0b6bc40d52a?

Want to double check before I upload to TER to quickly again 😄

Greetings Daniel

ErHaWeb commented 1 year ago

Hi @featdd,

now it works perfectly. Even terms with three German umlauts like "Ölüberschussländer" or "Geräteüberhöhung" are output correctly.

Thanks!

siegtob commented 1 year ago

Hi @featdd,

it works also for me - thanks a lot for your quick improvement!

Greetings, Tobi

featdd commented 1 year ago

Hi @ErHaWeb @siegtob,

thanks for the feedback, just pushed another update into TER.

Greetings Daniel

featdd / dpn_glossary

Words with german umlauts are not rendered properly. #193