Closed amilenkov closed 4 years ago
Thank you @amilenkov 👍
I have filed a PR with your file ...I see that 0x044D => 'e'
was removed as per your comment in https://github.com/backdrop/backdrop-issues/issues/1544#issuecomment-180761933. One minor modification on my end was to add a dot to the comment in order to adhere to coding standards.
PS: may I ask what was your resource for the Google implementation? I would like to do the same for the Greek transliteration file.
Thanks @klonos,
In Bulgaria there is an officially law-approved transliteration to Latin, but it is not used everywhere, but mainly in official documents, road signs, names of streets etc. People use various other systems, including using numbers for Bulgarian letters that do not have a Latin analogue.
For example, for the Bulgarian letter "ш " instead of the official "sh" they write digit 6 because it is shorter for input. Because the word for the number 6 in Bulgarian begins with the letter "ш" (шест).
You will see in the quoted documents that Google has learned to recognize even transliterated in this way (using figures) words because many people do.
Especially when using computers or smartphones without a Cyrillic keyboard they are very inventive.
That's why Google does not comply with the officially accepted transliteration, but follows the most common in Web pages.
I've tried to find Google sources how to do transliterate right, but I don't find anywhere. Perhaps because they do not perceive only one correct way of transliteration, but perceive a few commonly used and follow the practice of the people.
Source for the way I propose are recommendations from big specialized in SEO Bulgarian companies and from practice in my own sites. I see in practice that transliterated in this way Urls are correctly recognized by Google.
Here are two of the sources I have used (but they are in Bulgarian):
http://www.seo-bg.com/seo-google-transliteration-transliteracia.php
There are other sources of SEO professionals and they recommend the same transliteration rules.
I use such a system for a transliteration of two years by manually replacing the BG. php file with my version in each new version of Backdrop. I do the same for Drupal sites.
And I see from the practice that the pages are indexed more successfully.
Thanks for taking the time to respond @amilenkov ...there is a similar situation in Greek, where we have invented a method of input called Greeklish. There are variations of this method, depending on the preference of people to either to be phonetically correct (the argument being that phonetic is simpler + when non-natives read words they sound more accurate), or to be orthographically correct (the argument being similar to the reason behind this joke).
I was hoping that there would be some publicly accessible, "Google-approved" list of transliteration lists, but it seems that this has been the product of empirical work 👍 ...oh well.
It is easy to understand what system works better with Google.
For example, if you search in Bulgarian with the phrase "web site development" - in Bulgarian that is written "разработка на сайт".
In this transliteration, a problematic letter is "й" because is has not a Latin analogue and is transliterated different, such as "J", "y " or "i ". Much depends on what language the person has studied before and what is his level of education.
Those who have studied English would use one of three variants, equally understandable phonetically. But those who have studied Spanish in no case would use "J" because in Spanish this is an entirely different Bulgarian phonetic letter, that in English may sounds as "dzh" or "h".
So, if you do a search for "development of a site" in Bulgarian in Google:
You'll see most page results recognizing "site" transliterated as "sait" or "sayt".
But very small number of pages, shown in first pages of results in Google, if any, will have in their URL "sajt".
This is an illustration how to practically understand which system of transliteration works in the very process of work and search engine optimization.
This looks like an easy win, adding milestone candidate for the next bug-fix release.
That's great, thank you!
Since publishing this issue on October 13, 2018, with every core update, I had to manually edit the core / includes / transliteration / bg.php file to get the desired transliteration.
Since then I carefully observed the transliteration used by other Bulgarian sites and I am convinced that the proposed transliteration is the most common and acceptable for both site visitors and search engines.
I have developed and maintained more than 20 Backdrop CMS sites and since the beginning of this year I have only been working with this CMS.
I've done a code review. Kind of hard to test but seems safe to include.
Looks like we have a failing test:
fail първа статия is correctly transliterated to pyrva statija (actual: parva statiya) in bg langcode. transliteration.test:60
It looks like this issue was replaced by https://github.com/backdrop/backdrop-issues/issues/1604? Which states a similar purpose and was also vetted by @amilenkov. Please reopen if I've misunderstood. I've merged the associated PR from that issue. Thanks @amilenkov and @klonos!
Some time ago I sent you a bg. php file for transliteration into Bulgarian and you put it in core\includes\transliteration. In this old file, Latin transliteration is based on one of the some different methods used in Bulgaria for transliteration from Cyrillic to Latin.
Now I am sending a new bg. php file which is based on the rules for transliteration of Bulgarian, which are implemented by Google. This would help search engines to better understand Bulgarian keywords and text written in Latin in URLs and accordingly to better SEO of Bulgarian web pages.
I suggest that you replace the old file with the new one in core\includes\transliteration in the next version of the core.
bg.zip
PR by @klonos: https://github.com/backdrop/backdrop/pull/2321