elementary / website

The elementary.io website
https://elementary.io
MIT License
1.24k stars 706 forks source link

Serbian Ijekavian localization returns 404 error #444

Closed tomicakorac closed 8 years ago

tomicakorac commented 9 years ago

I'm not sure what is the best way to go about fixing this problem.

a) Transifex's code for Ijekavian dialect of Serbian language is sr@Ijekavian. b) eOS's beta site is configured to have localized URL in this case as http://beta.elementaryos.io/sr@Ijekavian/

I'm guessing that the '@' in the URL is causing the URL to fail. If I am right, these are the questions:

If I am not right (if the '@' in the URL isn't the cause of this problem), does anyone else have an idea why this URL is failing?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

emersion commented 9 years ago

Hi,

Yes, it's possible to change the directory name while keeping in sync with Transifex: http://docs.transifex.com/developer/client/config#language-map

Anyway, is it a good idea to rename it? Or is it better to keep the current name and to update the router?

lewisgoddard commented 9 years ago

According to this RFC (2.2. Reserved Characters) it's a reserved character, like : and '/' and shouldn't be used for anything other than their specified method, which in this case is denoting a user under a domain.

Seeing as that's not how we're using it, aliasing sr_Ijekavian seems to be the recommended way to go.

tomicakorac commented 9 years ago

That does sound like the way to go. I would just suggest, while we're at it, that we lose the capitalization, especially since it will be a part of a URL, and make it sr_ijekavian

emersion commented 9 years ago

For now, we were keeping capitalization in locale names (e.g. en_US). So we should keep it for this locale too.

tomicakorac commented 9 years ago

In case of the regional mark, such as _US, _GB, _DE, _FR etc. in my opinion, we're not talking about capitalization, but rather abbreviations, which is not the same as the capital initial letter only, especially because Ijekavian is a dialect, and not a territorial determinant. I would agree to have it in all caps if it was an abbreviation, e.g. _IJE, but in this case, if the parental language code is in small letters, I think the dialect should also be in small letters, also when we have in mind that it would be used in a web address.

emersion commented 9 years ago

I think it's better to stick to the original language code (sr@Ijekavian) as possible. Anyway, that's not really very important.

emersion commented 9 years ago

See this PR: #449

tomicakorac commented 9 years ago

http://beta.elementary.io/sr_Ijekavian/ returns 403 Forbidden now.

emersion commented 9 years ago

Yes, @fabianthoma needs to update the server config now.

emersion commented 9 years ago

Now returns the homepage, but still untranslated.

lewisgoddard commented 9 years ago

This appears to be still ongoing, but do we need sr_Ijekavian when we have sr ? We've gotten rid of all the other localized languages?

tomicakorac commented 9 years ago

I'm pretty sure that the vast majority of the visitors really do not need neither sr nor sr_Ijekavian. However, those few that do need one of those, really do. I see there is Arabic localization, and then there is Arabic (Sudan). There is Norwegian Nynorsk, and then there is Norwegian Bokmal. There is Portuguese (Portugal), and then there is Portuguese (Brasil). There is Chinese (Traditional), and then there is Chinese (Simplified). While it's true that who ever understands Serbian Ijekavian will also understand Serbian Ekavijan (the normallized standard), there are two points that I do not understand:

  1. Why is it that Serbian (Ijekavian) is the only localization out of the several dozen existing ones that we are having difficulty implementing?
  2. Why is Serbian Ijekavian the only localization whose necessity is being questioned?
lewisgoddard commented 9 years ago

Most of those localizations don't appear on the actual site, only in transifex. So, for your second question, we've already discussed the need for secondary localization and decided that in most cases it is not necessary. As I said, we've previously gotten rid of such languages. We did say, however, that it would be reviewed on a case by case basis.

For your first question, no standards-compliant locale has characters like @ in their identifier. Locales should be identified by a two letter code, like en, or two sections of identical code, separated by an underscore, as in en_GB. Because @ is a reserved character, typically used for identifying users in a domain, most browsers do not take kindly to it being used in URLs.

tomicakorac commented 9 years ago

I'm not sure what you mean by Most of those localizations don't appear on the actual site, only in transifex., since all the localizations I've mentioned do exist and are available to choose at http://elementary.io.

About the first issue, I've mentioned before that the notation which Transifex chose to use in this case is an obvious cause of the problem, but at the same time there is no particular reason to follow that notation. No one has ever used that exact marking, and it's still unclear to me why they did so. I might bring this up with Transifex, if that would resolve our problem, but I also believe that even if Transifex notation remained, there would still be a way to fix this bug.

lewisgoddard commented 9 years ago

I don't see the Norwegian localization on the site, but considering language names are translated, it's likely just me. As for the notation, we are indeed inheriting it from Transifex and every time we update the translations it will be pulled down into that location. I think we've tried to alias it but right now the nginx.conf has a regex designed solely for two letter language codes. Is there an equivalent for sr@Ijekavian in the format en_GB ?

tomicakorac commented 9 years ago

The two Norwegian dialects are listed as just Bokmal and Nynorsk, but I can see they're both there. As well as all the other double localizations I've mentioned.

I am not aware of any two-letter code for Ijekavian, unfortunately. But, again in my opinion, there are two facts we should have in mind here:

  1. The en_GB notation has never been standardized anywhere to this date. It's just something that once seemed right to someone, and then everyone just kinda went along with it (with slight modifications here and there, as we can see in the example of Transifex). Although for most of the 'big' languages out there the flaws of this kind of notation are not easily noticeable, or even do not exist, there are several strong arguments against it in general.
    • I have personally been a witness of my country changing its name 5 times in less than 10 years, so it becomes hard enough even for the natives to keep track of which abbreviation means what. At the same time, there have also been certain changes in what is actually being considered as 'Serbian language'. So neither part of the sr_RS abbreviation would be persistent, and the meaning of both has changed in time, so what has been sr_RS 20 years ago is not the same as today, and expecting it to change again in the future is also justified. I really don't want to go into details as to why or how it happened with Serbian, but I know dozens of other languages with the same or similar problems of mostly political nature (which, I hope we can agree, must not be of any concern to simple translators).
    • The en_GB notation presupposes that a single language will be strictly concealed within the territory of a single country, and/or vice versa, in reality this is a mere exception reserved for the several most developed and politically stable countries. As for the rest, it just won't do. sr_RS does not cover Serbian language in Serbia, as there are two major dialects of Serbian being spoken in Serbia. On the other hand, there are two major dialects of Serbian being spoken in Bosnia and Herzegovina, but at the same time they're identical to their respective equivalents in Serbia. In short, territorial boundaries do not have any role here, and will just bring in confusion and probably even political controversy. Again, even though I've illustrated my point on an example most visible to me, I've come across a great number of languages with the same problems, and only a handfull of those which are lucky enough to not be affected.
  2. As per the only Internationally accepted language code standard, there are no country codes in language codes, which historically lead developers to come up with a number of non-standard localization codes, an example of which being the en_GB notation. Having all that in mind, I see no reason whatsoever for us to come up with a new non-standard localization code (at leaset for Serbian) which would suit our needs.

I can suggest the following:

emersion commented 9 years ago

If we want to keep the xx_YY format for languages, we have to choose a free country code from here: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Decoding_table

I would recommend to use User-assigned codes, to avoid using a code that could be assigned for a country in the future. But since this second code is related to the country and we are speaking about dialects, this does not make sense...

@lewisgoddard Right now, Serbian Ijekavian has been renamed to sr_Ijekavian in the Transifex config file. Do you think it's possible to allow [lang]-[country]_[dialect] in language codes in Nginx config? Or would you prefer not to change the config and select a custom country code for Serbian, preventing page names like aa_bcd to be mistaken as language codes?

emersion commented 9 years ago

BTW, we should update Nginx config to allow three-letters language codes (e.g. Chinese (Min Bei) (mnp)): https://www.transifex.com/languages/

lewisgoddard commented 9 years ago

I had written a piece, but I've been having power-related issues at work. Effectively, I'd prefer to keep the regex simple and modify this one outlier. Very rarely is more complex regex better. sr_JK seems like the best option.

As for three letter codes, I am not overly familiar with regex, but changing it like this might do it.

rewrite "^/([a-z]{2}(?:_[A-Z]{2})?)/(.*)$" /$2?lang=$1 last;
rewrite "^/([a-z]{2}(?:_[A-Z]{2-3})?)/(.*)$" /$2?lang=$1 last;
emersion commented 9 years ago

The correct regex for three-letter codes is:

rewrite "^/([a-z]{2,3}(?:_[A-Z]{2})?)/(.*)$" /$2?lang=$1 last;

Okay, then let's change the language code mapping.

lewisgoddard commented 9 years ago

@emersion Currently, /lang/ has sr and sr_Ijekavian. What are we doing moving forward?

emersion commented 9 years ago

We can move /lang/sr_Ijekavian/ to /lang/sr_JK/, change mappings in Transifex config file and this issue should be solved.