Slugify and diacritics - Githubissues

arkadiuszwojcik commented 4 years ago

As I noticed, using slugify in recipes is quite often but there is one problem with it. Current implementation for sake of simplicity removes all diacritics from given string, but In many languages we whould like to transform them to other characters. For example in Polish: 'ę' -> 'e', 'ś' -> 's' etc. In CMS's often articles titles are slugify to form url but without proper mapped diacritics those urls looks odd. Basically it would be nice to have something simillar to: Diacritics.NET. Such code/service could be part of localisation module I guess.

hishamco commented 4 years ago

I think @neman mentioned something similar before. Am I right?

/cc @sebastienros

hishamco commented 4 years ago

This will be a nice feature - that can set on top of content localization module - for future releases if the team agrees

/cc @jptissot

neman commented 4 years ago

Yes, i mentioned #4475 where each Cyrillic char corresponds to Latin char, and it's called transliteration. It's similar but not the same, what @arkadiuszwojcik mention here. For example Ч transliterates to Č But Č can be represented as C without diacritic. Also Ћ transliterates to Ć which might be presented as C without diacritic But Ђ transliterates to Đ, which is represented as DJ (not D) So this what @arkadiuszwojcik wrote is this second level, and in Serbia we call it Cropped Latin (in polish przycięty łacina, and in serbian ošišana latinica)

Trivia Because one charachter can represent two letters there could be some words with different meanings if we clean diacritics. For example šišanje -> sisanje. First word means haircut, second means suck.

Skrypt commented 4 years ago

I think the idea is to make the url's compatible with older browsers. The idea is to avoid having HTML encoded chars like Space as '%20' or '&' as '%26'. So avoiding using diacritics in url's has always been made that way so far for that main reason and also it is easier for search engines to index with slugified url's.

sebastienros commented 4 years ago

I think it's fine to replace diacritics "by default" instead of removing them like in the current implementation. See SlugService.cs.

And maybe we should also have another service/filter like slugify that does transliteration.

hishamco commented 4 years ago

Seb I will close this one, coz you already created this one https://github.com/OrchardCMS/OrchardCore/issues/4151

arkadiuszwojcik commented 4 years ago

@hishamco is it actually the same? Looks like @sebastienros meant here 2 things. One is to include Cropped Latin as part of SlugService (this is what this issue is about) and another is to have sepparate filter for transliteration.

hishamco commented 4 years ago

Reopening this

arkadiuszwojcik commented 3 years ago

Options to explore:

Diacritics.NET
diacritical-dotnet MIT license
ICU4N or ICU4N.Transliterator see StackOverflow example
.NET 5 switched to ICU on Windows - does it help somehow?
Diacritics database project

hishamco commented 2 years ago

I think it's fine to replace diacritics "by default" instead of removing them like in the current implementation.

The current implementation replace diacritics, so 'ę' -> 'e', 'ś' -> 's' and so on

I created a unit test in #10921, @arkadiuszwojcik is there anything missing or did I missunderstood you? If Yes please break the unit test that I made, then I can see what you mean

hishamco commented 2 years ago

Seems it's already there:

https://github.com/OrchardCMS/OrchardCore/blob/1898095d0bef03a04668830bc75f5523dd5a249f/test/OrchardCore.Tests/Tokens.Content/SlugServiceTests.cs#L53-L58

arkadiuszwojcik commented 2 years ago

@hishamco - I will perform some tests today and let you know.

arkadiuszwojcik commented 2 years ago

@hishamco I can confirm that current solution works for me but with one exception. This is mapping table I was using in my solution:

Dictionary<char, char> mapping = new Dictionary<char, char>
        {
            { 'Ą', 'A' }, {'ą', 'a'},
            { 'Ć', 'C' }, {'ć', 'c'},
            { 'Ę', 'E' }, {'ę', 'e'},
            { 'Ł', 'L' }, {'ł', 'l'},
            { 'Ń', 'N' }, {'ń', 'n'},
            { 'Ó', 'O' }, {'ó', 'o'},
            { 'Ś', 'S' }, {'ś', 's'},
            { 'Ź', 'Z' }, {'ź', 'z'},
            { 'Ż', 'Z' }, {'ż', 'z'},
        };

Right now slugify do exactly same work but not for case: 'Ł' and 'ł' both translates to 'ł' instead of 'l'. So for code:

{{ "ĄĆĘŁŃÓŚŹŻąćęłńśóźż" | slugify }}

I get:

acełnoszzacełnsozz

For other languages there might be more issues/exceptions like that. In past I remember letters with diacritics were removed completly from string so for sure there is some improvment now.

arkadiuszwojcik commented 2 years ago

On Stackoverflow I found similar discussion: https://stackoverflow.com/questions/42645854/normalization-misses-polish-characters/51230541 Looks like there are some exceptions and not all characters can be normalized by unicode decomposition transformations. More on this here: http://zderadicka.eu/removing-diacritics-marks-from-strings/

hishamco commented 2 years ago

Thanks @arkadiuszwojcik for your testing, I will have a look to the links above,. Frankly I'm not sure if there's an exception the normalization method. I will reopen my PR with th example that you mentioned and trying to fix the issue

For other languages there might be more issues/exceptions like that. In past I remember letters with diacritics were removed completly from string so for sure there is some improvment now.

If I'm not wrong @jtkech did that, but may be the comment "Remove diacritics" in the code confuses us ;)

hishamco commented 2 years ago

According to the first link ł and Ł are expections and should be removed, Seems that decomposition not takes place for them, if they are the only exceptions this could be done in code, or we can make a list of exceptions that the use can add to them

arkadiuszwojcik commented 2 years ago

@hishamco - according to second link there is few languages with such exceptions. I wonder if it would be elegant solution to assume string normalization and later on search for exceptions in some predefined list. Other solution would require use of some diacritics library. Anyway looks that any solution around this problem should land in Localization module?

hishamco commented 2 years ago

Other solution would require use of some diacritics library.

We could, but for now I'm trying to make things simple. BTW if there's new APIs it will belong to OC.Localization not a module, but all the feedback are welcome

arkadiuszwojcik commented 2 years ago

We could, but for now I'm trying to make things simple. BTW if there's new APIs it will belong to OC.Localization not a module, but all the feedback are welcome

So to keep your solution simple I would ignore this issue for now and keep it open. Ultimate solution would require OC,Localization accents/diacritics lookup tables for various languages.

hishamco commented 2 years ago

IMHO we should provide an abstractions and leave the specific accent implementation to the dev. I'm working on a prototype in Orchard Core Contrib (OCC) that I will share once it's ready

hishamco commented 2 years ago

@arkadiuszwojcik I added Diacritics support with custom accent mapper in OCC, please check my PR https://github.com/OrchardCoreContrib/OrchardCoreContrib/pull/5 and more specifically the unit tests for PolishAccentMapper

/cc @neman

arkadiuszwojcik commented 2 years ago

@hishamco looks very good. It maps char to string as most solutions I saw in the wild.

hishamco commented 2 years ago

It was char to char then I realize some characters maps to more than one

hishamco commented 2 years ago

@arkadiuszwojcik is the solutions I provided in OCC suited for what you asked for? If YES we may bring this to OC

arkadiuszwojcik commented 2 years ago

@hishamco yes it is good solution, just one thing: can it be part of slugify operation (at least optional)?

hishamco commented 2 years ago

@arkadiuszwojcik could you please file an issue in OCC repo, then we could arrange with Seb if the provided solution suited to be move here

arkadiuszwojcik commented 2 years ago

@hishamco like this: OrchardCoreContrib/OrchardCoreContrib#11 ?

hishamco commented 2 years ago

@sebastienros Is it fine to support Diacritics as first class citizen in OC.Localization, then we can easily tweak SlugifyService after #11491?

If sound is good I will push the first PR tomorrow

OrchardCMS / OrchardCore

Slugify and diacritics #4874