flarum / framework

Simple forum software for building great communities.
http://flarum.org/
6.37k stars 835 forks source link

Slug transliteration #194

Closed tobyzerner closed 3 years ago

tobyzerner commented 9 years ago

Ex: https://chanphom.com/forums/luat-choi-chan-pro.29/ from "Luật chơi Chắn Pro"

dcsjapan commented 9 years ago

Transliteration is possible for many languages, but very difficult or impossible for a few languages (like Japanese). It would be best if there were a way to enable/disable this function; or barring that, percent encoding of unicode might be preferable as a more universally applicable solution.

tobyzerner commented 9 years ago

Currently slugs are generated using only alphanumeric characters, replacing anything else with a hyphen. However we should support some degree of transliteration so non-Latin languages still get slugs. This is an area where I don't have much knowledge, and help would be appreciated.

What needs to be done:

wielski commented 9 years ago

Maybe you can use library like this one? https://github.com/ashtokalo/php-translit

Buhito72 commented 9 years ago

In Spanish, the mod_rewrite replaces all Latin characters like ñ, accents, etc. with a hyphen. In order to improve the SEO would be better to rewrite the equivalent characters, for example: español ---> espanol (instead of espa-ol), corazón ---> corazon (instead of coraz-n). It can be done with a simple replacement of characters.

<?php function friendly_urls($url) {

$url = strtolower($url);

$find = array('á', 'é', 'í', 'ó', 'ú', 'ñ');

$repl = array('a', 'e', 'i', 'o', 'u', 'n');

$url = str_replace ($find, $repl, $url);

$find = array(' ', '&', '\r\n', '\n', '+'); $url = str_replace ($find, '-', $url);

$find = array('/[^a-z0-9-<>]/', '/[-]+/', '/<[^>]*>/');

$repl = array('', '-', '');

$url = preg_replace ($find, $repl, $url);

return $url;

} ?>

ISilvaPT commented 9 years ago

Same could be said for Portuguese: ã | â | á | à > a ê | é | è | > e í | ì | > i õ | ô | ó | ò > o ú | ù > u ç > c

dcsjapan commented 9 years ago

As I mentioned above and in flarum/framework#557, transliteration isn't a complete solution. There are some languages that can't be transliterated very easily, or at all.

In the case of Japanese, as I mentioned in Stumbling block 6, it would take a lot of rather sophisticated processing to come up reliable transliterations of words spelled using Chinese characters. And even the most sophisticated program will be reduced to guessing when it comes to things like names, which can use Chinese characters in nonstandard ways.

Japanese is clearly an extreme case, but even where the relationship between pronunciation and spelling tends to be more stable, there are still difficulties. To transliterate Chinese reliably, for example, you would need to provide a glossary of at least several thousand characters. So it's not always a matter of applying a few well-defined rules.

In regions where transliteration is impractical, there is a strong trend toward the use of unicode in URLs. Flarum will have to support that, or it will simply be irrelevant in those regions. At the same time, however, Flarum also needs to offer transliteration for regions that have adopted that approach.

My suggestion is:

Admins should be allowed to specify whether URLs should be transliterated or encoded. This could be implemented as an administrator setting, though it might be better still to have the question asked and answered during the installation process.

When an admin chooses the former, a library such as this one suggested by @FIrestarterUA could be used to transliterate all slugs, including thread titles, tag names, and usernames. (Flarum may need to check all these items and return an error whenever any non-transliteratable text is entered. Or we could leave it up to admins to tell their users: "Don't use any Chinese characters ... or else!")

When an admin chooses the latter, all URLs are encoded appropriately, with only an absolute minimum of character replacement (e.g. hyphens in place of spaces) being performed.

johannsa commented 8 years ago

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend? This way many character sets would be available.

Also, currently slugs for discussions are generated on the client which is not ideal. They should be generated on the server (and stored on the database like tag slugs are).

dcsjapan commented 8 years ago

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend?

I think that would be a great solution ... I'd just like to be sure there aren't any SEO implications for admins in regions where transliteration is the accepted approach.

franzliedke commented 8 years ago

As discussed in flarum/framework#646, we can use Stringy which gives us slugging functionality for free.

franzliedke commented 8 years ago

We might also want to truncate the slug after a certain length.

thecotne commented 8 years ago

i want to mention here that for georgian language slugs are not generated at all (from this "რა კაი ფორუმი წამოვჭიმეთ!" i got "--" this slug) and also Wikipedia approach is best for slugs

akalongman commented 8 years ago

+1 @tobscure We need unicode slugs

dsevillamartin commented 8 years ago

This looks good for different languages: Cocur/Sluglify. The only problem is that it needs the language to be fully spelled out, instead of en it needs english, although that is probably an easy fix. The other one I found which doesn't need a language, is Jbroadway/urlfix, although that one is more basic, I think. Whichever is better ;)

dcsjapan commented 8 years ago

Of the transliteration options mentioned, Slugify strikes me as the most worthy of consideration. It covers a wide range of languages out of the box, can easily customized to cover more, and is flexible when it comes to integration.

As @franzliedke said, Stringy may also be an option, especially if it can also be employed for tasks other than transliteration. One cause for concern is that it only does slugification, not true transliteration; that is, it seems to work on a fixed ruleset:

Converts the string into an URL slug. This includes replacing non-ASCII characters with their closest ASCII equivalents, removing remaining non-ASCII and non-alphanumeric characters, and replacing whitespace with $replacement.

This may not provide the best transliterations for all languages; converting ä to a would not work in a language where ae is the more commonly used transliteration. A more language-specific solution would give better results vis-a-vis both SEFiness and human readability.

I'm wondering whether it would be possible to use Stringy, but insert language-specific rulesets (like the ones used by Slugify) when available. We could put the ruleset file right in the language pack, as we've done with Moment.js translations. When the admin sets the forum's slugification style to "transliteration" (as opposed to "UTF-8") Flarum would grab the ruleset for the forum's default language and slugify based on that. If the language pack is lacking a ruleset, it could fall back to standard Stringy slugification.

Would something like this be possible?

EDIT: It would be best to have Stringy treat the language-specific ruleset as overrides, so it can default to its own slugification rules when it encounters a character that's not covered in the ruleset being used. That would allow it to cope with situations involving characters not included in the ruleset for the default language ... such as a topic about Søren Kierkegaard in a French forum.

This solution would be best suited to single-language forums. Handling of thread titles (etc.) in more than one language would tend to be hit-and-miss. And in cases where a forum includes languages requiring different slugification methods ... Russian and Japanese, for example ... the admin will be forced to use UTF-8 slugs. The only way around that would be to make Flarum truly multilingual, i.e. assign a locale value to each thread.

yihui commented 8 years ago

As a Chinese speaker, I'd just want a simple option to disable slugs of posts. I don't want either transliteration or Unicode characters in the URLs. Personally I also prefer shorter URLs like example.com/d/12345 instead of example.com/d/12345-hello-world Having Unicode Chinese characters in the URL will make it horribly long and messy like https://zh.wikipedia.org/wiki/Portal:%E6%96%B0%E8%81%9E%E5%8B%95%E6%85%8B when you copy the URL from the address bar of the browser (e.g. Chrome). That is not human readable, so such slugs will be useless. I think disabling transliteration is much easier to implement and more useful to Chinese users.

dcsjapan commented 8 years ago

Safari and Firefox are able to copy the URL in human-readable format. When I open the URL you linked above and copy it from the Safari address bar, I get this:

https://zh.wikipedia.org/wiki/Portal:新聞動態

So this should probably be considered a deficiency of Chrome ... or of your OS, perhaps. That said, a third option to disable slugs altogether shouldn't be too hard to implement, and may be wanted by enough site admins that it would be worth adding.

believer-ufa commented 8 years ago

Hello guys :) You hear about PHP Intl Transliterator extension?

For example, you can use this snippet of code for transliterate any strings to latin characters (even japanese characters, as I know)

<?php
$rules = 'Any-Latin; Latin-ASCII; [\u0080-\uffff] remove';

echo transliterator_transliterate($rules,'Какая-то строка, которая нуждается в транслитерации');
// Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii

echo transliterator_transliterate($rules,'新聞動態');
// xin wen dong tai

echo transliterator_transliterate($rules,'რა კაი ფორუმი წამოვჭიმეთ');
// ra kai porumi tsamovchimet

You can find more info about this transliterator functions in sources of Yii 2 framework, for example.

believer-ufa commented 8 years ago

Also in page with description of Intl extension you can find message of one of php developers in which it is written one of possible solutions to transform string into the correct transliterated url:

<?php
function slugify($string) {
    $string = transliterator_transliterate("Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();", $string);
    $string = preg_replace('/[-\s]+/', '-', $string);
    return trim($string, '-');
}

echo slugify("Я люблю PHP!"); // a-lublu-php
echo slugify('რა კაი ფორუმი წამოვჭიმეთ'); // ra-kʼai-porumi-tsʼamovchʼimet
echo slugify('新聞動態'); // xin-wen-dong-tai
?>

I think, it need to test on some count of strings to choose the more correct method :)

franzliedke commented 8 years ago

@believer-ufa Thanks for pointing it out, we'll take a look.

However, since this requires the intl extension, we probably have to use another approach (library).

believer-ufa commented 8 years ago

@franzliedke, you already use a gd and mysql extensions. Why the use of this extension is the problem? On any linux OS its a problem what resolved by one command like sudo apt install php7.0-intl.

You most likely will not be able to do a same good transliteration with some other library, since in the majority of these libraries are intended only for certain languages.

franzliedke commented 8 years ago

Well, you will probably agree that we can be reasonably certain that MySQL is installed everywhere. (And even if not, Flarum can not function without it.)

But yeah, I'm open to the idea. Does anybody know some place with PHP extension installation stats?

believer-ufa commented 8 years ago

I little dont understand you. Flarum Installation guide tell to user about needs a SSH acces and _PHP 5.5+ with the following extensions: mbstring, pdomysql, openssl, json, gd, dom, fileinfo. Its a common situation: install some PHP extensions to be able to run some framework. You just need install a one more extension for have correct transliterations in you forum)

dcsjapan commented 8 years ago

@believer-ufa Not every Flarum admin will have the access necessary to install the extension. One of the devs' goals is to keep Flarum easy to install on shared hosting plans. Every extension added can limit the number of providers that will be able to support Flarum. I think that's why @franzliedke is asking about extension installation stats; it's a decision that can't be made too casually.

believer-ufa commented 8 years ago

Okay, but it really nice extension :) Look at discussion on Flarum forums, one of the participants already convinced about this approach.

You can also write the code so that it does not require the presence Intl extension, but if available have used it. I think it will be the right solution that will avoid problems with bad hosting and will give us a solution to this problem.

jordanjay29 commented 8 years ago

Maybe @believer-ufa's method is a better extension, regardless of who makes it. Then composer can check if the proper extension is available and refuse to install if not. Being so dependent on an additional php module, if it's not widely installed, may hurt Flarum's ability to be widespread more than lacking this feature.

believer-ufa commented 8 years ago

jordanjay29, you can write code what uses Intl if exist, but if not exist Flarum can work, but without nice and full language URL transliteration. Read my above comment

franzliedke commented 8 years ago

Well, not using the Intl extension does not mean we can't implement transliteration. There are enough libraries out there.

Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.

dcsjapan commented 8 years ago

Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.

That sounds promising. 😀

hgtucel commented 8 years ago

We use Turkish characters in titles, but it does not look good seo link.

Example: Title: Türkçe Deneme Asğşiçü Link: trkce-deneme-as

Turkish characters: İ ı ş ç ğ ü

How can I solve this problem?

(I'm sorry bad english.)

dcsjapan commented 8 years ago

@hgtucel Please see my comments in your forum thread.

saggel commented 8 years ago

I'm here to add Greek on the table too, as I pointed out on the forum Happy to help with the mapping if needed!

jordanjay29 commented 8 years ago

Referencing a new extension by @avatar4eg that offers a potential solution.

HLFH commented 8 years ago

@jordanjay29 Does not work for me. https://github.com/Avatar4eg/flarum-ext-transliterator/issues/1

HLFH commented 8 years ago

@jordanjay29 Edit: it works but you have to fix manually the old forum pages URLs by renaming them twice, then the flarum-ext-transliterator extension does its job. For the new created pages, the URLs are ok. capture d ecran 2016-09-22 a 14 24 04

jordanjay29 commented 8 years ago

@HLFH I'm not the extension author, please report this bug on the extension thread at Flarum.org, or on the author's github.

HLFH commented 8 years ago

@jordanjay29 Already done. https://github.com/Avatar4eg/flarum-ext-transliterator/issues/1

firegurafiku commented 8 years ago

Let me support the idea which was proposed by @yihui: there should be an option to either disable slugs completely, or set them manually. Or, better, both of them.

Forcing everyone to use machine-transliterated slugs is a huge hurt, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.

@believer-ufa

The library you proposed seem to do only the simplest table-based substitutions. Let me comment your example:

Какая-то строка, которая нуждается в транслитерации Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii

Or maybe: kakaya, kotoraya, nuzhdaetsya. According to your nickname, you should know that Russian has a bunch of different transliteration schemes. Even the government cannot decide which one to use.

新聞動態 xin wen dong tai

But how about reading this in Japanese: shinbun dotai? Or maybe Korean reading? Unicode does not distinguish between Chinese, Japanese and Korean graphemes.

Even Latin-based scripts cannot be reliably transliterated.
Moreover, what if user wants title translation, not a transliteration in their URLs?

dcsjapan commented 8 years ago

Unicode does not distinguish between Chinese, Japanese and Korean graphemes.

Even Latin-based scripts cannot be reliably transliterated.

Just so!

Moreover, what if user wants title translation, not a transliteration in their URLs?

That might be worth investigating as an idea for a third-party extension. For now, I think it would be sufficient if Flarum could offer a robust system to provide for both transliteration and unicode, with enough configuration options to allow admins in any region to tweak its behavior to their liking.

yihui commented 8 years ago

a robust system to provide for both transliteration and unicode

plus an option to disable slugs completely please... :)

dcsjapan commented 8 years ago

plus an option to disable slugs completely please... :)

I don't see why that couldn't be added. Compared to everything else, it would be _easy._ 😄

Incidentally,

Having Unicode Chinese characters in the URL will make it horribly long and messy like when you copy the URL from the address bar of the browser (e.g. Chrome).

I don't experience this sort of thing when using Safari (though I have seen it when using Firefox). One would hope that the other browsers could get with the program and make it possible to copy and paste properly encoded URLs so they result would be human readable ... 🙄


EDIT: See my comment below.

believer-ufa commented 8 years ago

Forcing everyone to use machine-transliterated slugs is a huge hurn, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.

Interesting logic, but I believe that you create too much of an issue out of this topic. We just need the URLs, which will be have some info about conversation. After all, nothing terrible will happen if the url will be slightly incorrect. But there is better to have at least something: it allows you to add the search engines additional information about the page for better SEO optimization.

franzliedke commented 8 years ago

On the other hand I'm not sure what search engines do with nonsensical information (such as from a wrong transliteration) in the URL. Thanks for bringing it up, @yihui and @firegurafiku!

dcsjapan commented 7 years ago

Scratch that ... I just copied and pasted a Google URL with Safari and ended up with a string of very non-human-readable percent encodings in it. I had been thinking that Safari fixes percent-encoded URLs when copying to the clipboard, but that doesn't seem to be the case after all.

So the issue raised by @yihui is definitely something we need to think about.

aethior commented 7 years ago

I'm not developer, but I want to share my opinion as user and webmaster. Why not copy the Wordpress (the most used cms) slug method?

Wordpress uses latin letters in lowercase, without symbols or marks, and you have the possibility to use characters from other alphabets. I also think interesting the possibility to short URL without post title (option in admin panel).

In any case, I want to show my negative opinion to method similar to Wikipedia. I'm spanish and my language uses a lot symbols and marks, and the Wikipedia URLs are annoying when you want to share Wikipedia links.

I think the url method should be simple, and complex transliteration added by extension (Wordpress has differents plugins for that).

sijad commented 7 years ago

neither transliterator_transliterate nor Slugify is suitable for Persian language.

believer-ufa commented 7 years ago

@sijad, if we talking about slugify, you can easily add you own rules for your language.

thecotne commented 7 years ago

what if we use github issue like urls? (id only no slug no transliteration) and then some plugins may change urls ....

aethior commented 7 years ago

what if we use github issue like urls? (id only no slug no transliteration) and then some plugins may change urls ....

Those urls are not seo and human friendly. Your suggestion was discussed here: https://github.com/flarum/core/issues/1140#issuecomment-284613976

firegurafiku commented 7 years ago

@believer-ufa

if we talking about slugify, you can easily add you own rules for your language.

How about easy adding support for Chinese or Japanese? Languages are hard and nobody should rely on automatic romanization. Instead, there should be options to disable slugs at all, or set them manually.

sijad commented 7 years ago

@believer-ufa in Persian people usually does not use diacritics in texts, so Slugify is not an option, for Persian language (and Arabic?) using unicode plus a few filters (remove diacritics, non-alphanumerics, spaces, etc) is best option.