danielstjules / Stringy

A PHP string manipulation library with multibyte support
MIT License
2.46k stars 216 forks source link

Moved the cyrillic letters `ь` and `ъ` from the latin `b` tto `y`. #164

Closed iipavlov closed 7 years ago

iipavlov commented 7 years ago

In Bulgarian the cyrillic ъ is a vowel pronounced as in cut. ь appears only in combination: ьо pronounced as in yo-yo. Both are available only in some alphabet subsets, but I don't think that anywhere they are pronounced as the Latin b. The bg specific transliteration is according https://en.wikipedia.org/wiki/Romanization_of_Bulgarian

danielstjules commented 7 years ago

Thanks for the PR, the second commit looks good! As for the first, it looks like you're right - I can't find any romanization/transliteration guidelines that would convert them to the latin b. :)

However, should ъ/ь actually default to "/' (quotes and apostrophes being the transliteration of prime ′ & double prime ″) or empty strings? https://en.wikipedia.org/wiki/Romanization_of_Russian

iipavlov commented 7 years ago

@danielstjules, The ъ and ь are probably the most confusing letters in Cyrillic - they don't exist in all alphabets and where they exist they have different sounds (or rather influence on the surrounding sounds) https://en.wikipedia.org/wiki/Cyrillic_script Anyway, my reason to not represent them as quotes and apostrophes (as proposed in the Russian transliteration) was that I wanted to keep alphanumeric representation, so the romanization could be used in URLs - a common scenario for SEO.

danielstjules commented 7 years ago

Anyway, my reason to not represent them as quotes and apostrophes (as proposed in the Russian transliteration) was that I wanted to keep alphanumeric representation, so the romanization could be used in URLs - a common scenario for SEO.

Makes sense! But toAscii really only performs transliteration/romanization to the ASCII range, which would include apostrophes. For use with URLs, I'd recommend slugify which would strip those special chars

iipavlov commented 7 years ago

Actually ъ is rather a big deal in Bulgarian - it exists in the name of the country and language itself - България, български and there are words like ъгъл (corner) which would look very strange if ъ is replaced with empty string or apostrophe. As much as I know this letter is very rear (or nonexistent) in the other languages using Cyrillic, so the possible risks of misrepresentation are less there.

danielstjules commented 7 years ago

But the transliteration for Bulgarian would be correct when supplying ->toAscii('bg')? I'm only suggesting we change the default, if that makes sense.

However, that does bring up the point that slugify is missing a $language param!

iipavlov commented 7 years ago

However, that does bring up the point that slugify is missing a $language param!

As much as I know this letter is very rear (or nonexistent) in the other languages using Cyrillic, so the possible risks of misrepresentation are less there.

This is why I proposed in the default map. It would fix it backward. Languages like Serbian and Macedonian would never notice it - as it is not used in them, in Russian the words with it are very rear, and the y as replacement would make sense also there as it is a no-sound letter also. Or they could use the language param, later.

danielstjules commented 7 years ago

Seems reasonable to me, thanks again! :)