OpenMage / magento-lts

Official OpenMage LTS codebase | Migrate easily from Magento Community Edition in minutes! Download the source code for free or contribute to OpenMage LTS | Security vulnerability patches, bug fixes, performance improvements and more.
https://www.openmage.org
Open Software License 3.0
868 stars 436 forks source link

Product url_key generation from utf8 name #1004

Open ioweb-gr opened 4 years ago

ioweb-gr commented 4 years ago

Magento cannot handle creating a url key from UTF8 characters like Greek Characters for example. To replicate try to create a product or category and type these characters in the name field.

Δοκιμαστικός Τίτλος

And then save the product or category without putting anything in the url_key

You will notice some junk characters create with dashes and i characters

I would like to add a slugify function instead to create meaningful url keys using https://github.com/cocur/slugify which is really tested in a lot of cases.

Do you think something like this would be a BC break?

The affected function would be

\Mage_Catalog_Helper_Product_Url::format

Which would get entirely replaced by the above slugify library. It does add a dependency in composer.json though.

For what I can see it's only used twice

app/code/core/Mage/Catalog/Model/Category.php:475 app/code/core/Mage/Catalog/Model/Product/Url.php:158

jfksk commented 4 years ago

Even tho the behavior is broken, I think that there likely are stores relying on it. IMHO a change should be configurable. Maybe even in an config file, so there is no chance, a merchant can accidentally screw up his URLs.

What about using INTL, instead of introducing a library, for generating slugs?

$slug = Transliterator::createFromRules(
    ':: Any-Latin;'
    . ':: Latin-ASCII;'
    . ':: NFD;'
    . ':: [:Nonspacing Mark:] Remove;'
    . ':: NFC;'
    . ':: Lower();'
    . '[:Separator:] > \'-\''
)->transliterate('?-testing the Δοκιμαστικός Τίτλος © тест 😉');

echo preg_replace('/[^a-z0-9]/u', '-', $slug);
//results in: --testing-the-dokimastikos-titlos--c--test--
Flyingmana commented 4 years ago

url keys are only generated when it is initially created and then stored in the database, so the direct impact is in any case lower. Using a battle tested library sounds like a good thing. Using composer dependencies is not finished yet, but not a general problem. It may be something for one of our yearly Major releases.

ioweb-gr commented 4 years ago

I also like the idea of @jfksk for keeping dependencies to a minimum to be honest. I just didn't think about it because I keep relying on cocur/slugify on all my ERP / ESHOP integrations.

One thing that just came to mind is whether this will affect cases where the admin duplicates a product. Generally magento will add a suffix to the url key when it finds a duplicate but that doesn't seem to be part of the format function.

jfksk commented 4 years ago

@Flyingmana the impact for most stores might be really small or non existing, but I would not fully replace the existing logic. E.g.: I know cases where external systems were mimicking the Magento logic. It's crazy and odd but such things exist. Also who knows what people implemented because of the shortcomings of M1. That's why I'm a little reserved on that. ATM I'd prefer leaving the option to stay with the existing logic - at least for a transition.

@ioweb-gr ignoring the dependency, I'd rather use or use something that uses the ICU library (INTL), as it is kinda the standard for handling UTF-8. You can throw a lot at it and you will get something meaningful. E.g. Mandarin 你好 will correctly result in ni-hao - with the above example. There is AFAIK a lot of complexity behind transliteration and I think that the ICU lib. does it very well.

ioweb-gr commented 4 years ago

Well in that case providing an option in config under seo settings would be our best option for a while.

Something like

url transliteration --> yes|no

When you say

mimicking the way magento does it in external systems

It does mean that those people suffer from the same broken function though. If we're planning to add features we won't always be able to not cause BC breaks. As long as we document BC breaks carefully the people following should update their code and in the end the config option would be improved after a major update

As far as the method, I wouldn't mind either way. We could test with a lot of unicode strings both of them and compare the results before deciding I guess

simbus82 commented 4 years ago

Something like

url transliteration --> yes|no

This is what happen in other CMS (a good example is Joomla CMS that manage very well multilanguage, non-latin and rtl websites). image (screenshot from upcoming Joomla 4.0)

ioweb-gr commented 4 years ago

This is something different actually. It's not transliterating. It's allowing the usage of UTF-8 characters in the url instead so you'd get a URL like this when you copy it

/product-category/%ce%b1%ce%be%ce%b5%cf%83%ce%bf%cf%85%ce%ac%cf%81-2/

but which appears like this normally

/product-category/αξεσουάρ-2/