Open arkadiuszwojcik opened 4 years ago
I think @neman mentioned something similar before. Am I right?
/cc @sebastienros
This will be a nice feature - that can set on top of content localization module - for future releases if the team agrees
/cc @jptissot
Yes, i mentioned #4475 where each Cyrillic char corresponds to Latin char, and it's called transliteration. It's similar but not the same, what @arkadiuszwojcik mention here. For example Ч transliterates to Č But Č can be represented as C without diacritic. Also Ћ transliterates to Ć which might be presented as C without diacritic But Ђ transliterates to Đ, which is represented as DJ (not D) So this what @arkadiuszwojcik wrote is this second level, and in Serbia we call it Cropped Latin (in polish przycięty łacina, and in serbian ošišana latinica)
Trivia Because one charachter can represent two letters there could be some words with different meanings if we clean diacritics. For example šišanje -> sisanje. First word means haircut, second means suck.
I think the idea is to make the url's compatible with older browsers. The idea is to avoid having HTML encoded chars like Space as '%20' or '&' as '%26'. So avoiding using diacritics in url's has always been made that way so far for that main reason and also it is easier for search engines to index with slugified url's.
I think it's fine to replace diacritics "by default" instead of removing them like in the current implementation. See SlugService.cs
.
And maybe we should also have another service/filter like slugify
that does transliteration.
Seb I will close this one, coz you already created this one https://github.com/OrchardCMS/OrchardCore/issues/4151
@hishamco is it actually the same? Looks like @sebastienros meant here 2 things. One is to include Cropped Latin as part of SlugService (this is what this issue is about) and another is to have sepparate filter for transliteration.
Reopening this
Options to explore:
diacritical-dotnet MIT license
ICU4N or ICU4N.Transliterator see StackOverflow example
.NET 5 switched to ICU on Windows - does it help somehow?
I think it's fine to replace diacritics "by default" instead of removing them like in the current implementation.
The current implementation replace diacritics, so 'ę' -> 'e', 'ś' -> 's' and so on
I created a unit test in #10921, @arkadiuszwojcik is there anything missing or did I missunderstood you? If Yes please break the unit test that I made, then I can see what you mean
@hishamco - I will perform some tests today and let you know.
@hishamco I can confirm that current solution works for me but with one exception. This is mapping table I was using in my solution:
Dictionary<char, char> mapping = new Dictionary<char, char>
{
{ 'Ą', 'A' }, {'ą', 'a'},
{ 'Ć', 'C' }, {'ć', 'c'},
{ 'Ę', 'E' }, {'ę', 'e'},
{ 'Ł', 'L' }, {'ł', 'l'},
{ 'Ń', 'N' }, {'ń', 'n'},
{ 'Ó', 'O' }, {'ó', 'o'},
{ 'Ś', 'S' }, {'ś', 's'},
{ 'Ź', 'Z' }, {'ź', 'z'},
{ 'Ż', 'Z' }, {'ż', 'z'},
};
Right now slugify
do exactly same work but not for case: 'Ł'
and 'ł'
both translates to 'ł'
instead of 'l'
. So for code:
{{ "ĄĆĘŁŃÓŚŹŻąćęłńśóźż" | slugify }}
I get:
acełnoszzacełnsozz
For other languages there might be more issues/exceptions like that. In past I remember letters with diacritics were removed completly from string so for sure there is some improvment now.
On Stackoverflow I found similar discussion: https://stackoverflow.com/questions/42645854/normalization-misses-polish-characters/51230541 Looks like there are some exceptions and not all characters can be normalized by unicode decomposition transformations. More on this here: http://zderadicka.eu/removing-diacritics-marks-from-strings/
Thanks @arkadiuszwojcik for your testing, I will have a look to the links above,. Frankly I'm not sure if there's an exception the normalization method. I will reopen my PR with th example that you mentioned and trying to fix the issue
For other languages there might be more issues/exceptions like that. In past I remember letters with diacritics were removed completly from string so for sure there is some improvment now.
If I'm not wrong @jtkech did that, but may be the comment "Remove diacritics" in the code confuses us ;)
According to the first link ł and Ł are expections and should be removed, Seems that decomposition not takes place for them, if they are the only exceptions this could be done in code, or we can make a list of exceptions that the use can add to them
@hishamco - according to second link there is few languages with such exceptions. I wonder if it would be elegant solution to assume string normalization and later on search for exceptions in some predefined list. Other solution would require use of some diacritics library. Anyway looks that any solution around this problem should land in Localization module?
Other solution would require use of some diacritics library.
We could, but for now I'm trying to make things simple. BTW if there's new APIs it will belong to OC.Localization not a module, but all the feedback are welcome
We could, but for now I'm trying to make things simple. BTW if there's new APIs it will belong to OC.Localization not a module, but all the feedback are welcome
So to keep your solution simple I would ignore this issue for now and keep it open. Ultimate solution would require OC,Localization accents/diacritics lookup tables for various languages.
IMHO we should provide an abstractions and leave the specific accent implementation to the dev. I'm working on a prototype in Orchard Core Contrib (OCC) that I will share once it's ready
@arkadiuszwojcik I added Diacritics support with custom accent mapper in OCC, please check my PR https://github.com/OrchardCoreContrib/OrchardCoreContrib/pull/5 and more specifically the unit tests for PolishAccentMapper
/cc @neman
@hishamco looks very good. It maps char
to string
as most solutions I saw in the wild.
It was char
to char
then I realize some characters maps to more than one
@arkadiuszwojcik is the solutions I provided in OCC suited for what you asked for? If YES we may bring this to OC
@hishamco yes it is good solution, just one thing: can it be part of slugify operation (at least optional)?
@arkadiuszwojcik could you please file an issue in OCC repo, then we could arrange with Seb if the provided solution suited to be move here
@hishamco like this: OrchardCoreContrib/OrchardCoreContrib#11 ?
@sebastienros Is it fine to support Diacritics as first class citizen in OC.Localization
, then we can easily tweak SlugifyService
after #11491?
If sound is good I will push the first PR tomorrow
As I noticed, using slugify in recipes is quite often but there is one problem with it. Current implementation for sake of simplicity removes all diacritics from given string, but In many languages we whould like to transform them to other characters. For example in Polish: 'ę' -> 'e', 'ś' -> 's' etc. In CMS's often articles titles are slugify to form url but without proper mapped diacritics those urls looks odd. Basically it would be nice to have something simillar to: Diacritics.NET. Such code/service could be part of localisation module I guess.