Reinmar commented 7 years ago

EDITED: See https://github.com/ckeditor/ckeditor5/issues/387#issuecomment-271600724

Process

We code the features using the t() functions in which we define the English value of a string (and, optionally, a context). E.g. t( 'Bold' ) or t( 'Button [context: clothing]' ) (the [context: clothing] will be automatically removed upon build on, in the dev mode, on runtime). Each context must be defined in lang/contexts.json of a package which uses it or in the ckeditor5-core/lang/contexts.json (for common strings).
From time to time, we run a tool which scans code for all t() usages, extract strings from it, builds a contexts map (based on all lang/contexts.json files) and checks whether all used strings have defined contexts. Then builds a PO file with English strings and contexts and upload it to Transifex.
Then we run a tool which downloads all PO files for all defined language and put them back in lang/ directories.
When building an editor, a specific bundler read PO files and create translation map(s) for chosen language(s) based on them and bundles those as defined in the next section.

Implementation

`ckeditor5-core/src/editor~Editor#constructor`

I propose to merge the ctx name into the string in order to keep the same t() params in every possible environment t( str, values ). If it was a separate param, then in the build version (in which we replace strings with ids) it would have to be left unused, or, we'd need to change the implementation of t() on the fly.

The CKE_LANG will be defined by the bundler. It will be the editor class that's going to set it which means that it will only need to be set when bundling for the browser environment. Or we could go a bit further than that and define utils/global object which would retrieve the global scope depending on the env. In the browser that would be a window, in Node.js that would... something which is global there. That would allow this code to work without the bundler needing to preset this value.

PS. we already have utils/dom/global, but I think that it makes sense to keep them separated.

this.config.define( 'lang', global.CKE_LANG || 'en' );

this.locale = new Locale( lang );

const t = this.locale.t;

// Usage:
t( 'OK' );
t( 'button' );
t( 'button [context: clothing]' );

`ckeditor5-*/lang/contexts.json`

Contexts for used messages.

Examples: https://github.com/ckeditor/ckeditor-dev/tree/master/dev/langtool/meta

ckeditor5-core/lang/contexts.json:

{
    "OK": "Label for OK button in a dialog."
}

ckeditor5-form/lang/contexts.json:

{
    "button": "Name of a clickable form control, e.g. 'OK button'."
}

ckeditor5-tailor/lang/contexts.json:

{
    "button [context: clothing]": "Button as used in clothes."
}

The button is first defined in the ckeditor5-form package without a context, because, e.g. historically, we could've used it there without a context (cause, in our case, button is a UI button most of the time). Then, while working on the CKEditor 5 Tailor plugin we realised that button is already used, but in a different context, so we can't use t( 'button' ) as it will point to a wrong context definition (contexts are global for all packages). Instead, we'll use t( 'button [context: clothing] ' ) and add its own definition in the ckeditor5-tailor/lang/contexts.json.

`ckeditor5-utils/src/locale`

No magic here – uses the translate() function of the utils/translation-service module to get translated string and replaces placeholder values ($1, $2, etc.) with passed args.

import { translate } from './translation-service';

export default class Locale {
    constructor( lang ) {
        this.lang = lang;
    }

    t( str, ...values ) {
        const translatedString = translate( this.lang, str );

        // ... do the rest (values replacement)
    }
}

`ckeditor5-utils/src/translation-service`

The goal of this module is to encapsulate the whole "translations repository" logic from the rest of the code.

It may be dirty with some preprocessor rules or may need to be generated on the fly depending on the build type (e.g. with or without code splitting – i.e. with separate or built in language file). However, it would be good if it was defined in the repository so for dev purposes we wouldn't have to do anything. It'd just return the passed string.

Development mode implementation:

export function translate( lang, str ) {
    // Remove the ` [context: ...]` suffix.
    return str.replace( / \[context: [^\]]+\]$/, '' );
}

For the bundles, we have two cases:

where only one language is needed,
where there are multiple languages.

In case of just one language, it'll be best to simply replace the str param of t() calls with the proper value (without the ctx now). This will allow for code splitting and hot-loading plugins without any tricks. It may happen, though, that some strings will then repeat multiple times – e.g. the ones from the core package. While this is going to make an un-gzipped package bigger, there shouldn't be a difference in case of a gzipped code. Besides, this will be just a few strings anyway and we'll save some space having such a simple implementation too.

export function translate( lang, str ) {
    return str;
}

In case of multiple languages we need to have some registry. The translate() implementation will be again simple:

export function translate( lang, str ) {
    return translations[ lang ][ str ];
    // Let this be objects, not maps, because we control the values of lang and str
    // and it will be a bit easier to generate a source of an object programatically.
    // Objects may also be better in terms of code size.
}

What's the str in this case? In order to ensure that we don't have name collisions (important for bundle splitting) I'd say that this should be either:

A totally unique string – like typical uid, but preferably shorter, using a wider range of unicode characters. A 5 chars long string using a range of all Unicode chars will give us comparable complexity to utils/uid:
```
> Math.pow( Math.pow( 2, 16 ), 5 )
1.2089258196146292e+24
> Math.pow( 35, 16 )
5.070942774902497e+24
```
The thing which worries me is whether there won't be any issues with encoding – we've seen people sourcing CKEditor in some weird encodings and it was always blowing up. But one could use a minifier which encodes special characters and it was be working again.

Another thing is ability to debug such code. With unreadble ids it may be tricky.

Finally – creating objects in which these uids are keys may be tricky. I can't find now whether there are any characters which needs to be escaped (other than the closing quote).

Anyway, this solution may create unpredictable and stupid issues.
Therefore, we may just use sequential ids from a short range (e.g. [a-z]). What about code splitting? There are two cases:
- Code splitting was done when building the whole setup – then there won't be any problems in using non-conflicting ids, because it's done by one bundler at one time.
- Someone, first built e.g. ckeditor5-preset-article-editor and then, separately (might be a different person), built ckeditor5-image. In this case, the preset bundle will use normal, short, ids (because it's the main bundle) and the image feature bundle will use prefixed ids (e.g. ckeditor5-image/<id>). The idea is that a developer releasing his/her package will use a special bundler setup which configures CKEditor plugin for Webpack to use prefixed ids.

Anyway, this is nothing we have to worry today, because, most likely, we'll work on releasing standalone package bundles after 1.0.0.

Another thing to notice is that support for multiple languages and code splitting is an optional feature, so we can implement it for just one bundler, i.e. Webpack.

Let's say, that we want to split a package to some preset X and packages Y and Z. This will make for 3 files – x.js, y.js, z.js.

The idea is that each files will have an accompanying language(s) files defining a translations needed for this file. So there will be:

x.js, x-pl.js, x-en.js, ...
y.js, y-pl.js, y-en.js, ...
z.js, z-pl.js, z-en.js, ...

In order to run an the X preset with plugins with Polish translations one will need to load: x.js, y.js, z.js, x-pl.js, y-pl.js, z-pl.js.

The language files will be built using entry points like this:

import { define } from '@ckeditor/ckeditor5-utils/src/translation-service';

define( 'pl', {
    'a': 'OK',
    'b': 'Anuluj',
    // ...
} );

And the define() function will merge these translations into other that it already has.

Tadam!

PS. PO file generation

We've been changing the idea which part of t() call is msgctxt, msgid and msgstr twice already, so let's clarify this:

For the following t() calls and contexts.json:

t( 'OK' );
t( 'button' );
t( 'button [context: clothing]' );

ckeditor5-core/lang/contexts.json:

{
    "OK": "Label for OK button in a dialog."
}

ckeditor5-form/lang/contexts.json:

{
    "button": "Name of a clickable form control, e.g. 'OK button'."
}

ckeditor5-tailor/lang/contexts.json:

{
    "button [context: clothing]": "Button as used in clothes."
}

The en.po file would look like this:

msgid "OK"
msgstr "OK"
msgctxt "Label for OK button in a dialog."

msgid "button"
msgstr "button"
msgctxt "Name of a clickable form control, e.g. 'OK button'."

msgid "button"
msgstr "button"
msgctxt "Button as used in clothes."

In this case the msgstr can be empty because then gettext will use msgid. In fact, in the samples I found in https://github.com/ckeditor/ckeditor5/issues/387#issuecomment-271349641 it's empty.

Regarding https://github.com/ckeditor/ckeditor5/issues/387#issuecomment-271349641:

As msgid we use the id of that string, not the "Bold" string. Why? The reasons were explained in the post I quoted – the msgid has limited length and, more importantly, you lose all the translations if you change the id. So, if we'd use the real string "Bold" as msgid, if we decided that for some reason it should be "Bold text", then translations would need to repeat their work and for some time we wouldn't have this string translated at all. This was pointed out by @wwalc and it happened (the English text was changed) in the past in CKE4 multiple times.

We ignore this issue. It's very rare situation.

jodator commented 7 years ago

Do you plan to sue whole API from gettext? Like plural handling and stuff?

Reinmar commented 7 years ago

Nope, I don't think so. Maybe I'm wrong, cause finally I haven't checked that precisely, but I think that it's too much work and we have very little use cases for those things, so this would be a low priority.

I mean – knowing CKE4 – there were just a couple of places where we had to format a plural, which you can always somehow workaround, even if not very gracefully. So this wasn't high on the priority list.

OTOH, perhaps we could dive into this, and if it's possible, predict a full gettext support. I wouldn't like to work on it now, but if it would be possible to implement it in the future, then we'd be safe. Or it may turn out that it's simpler than I think... I'll better check it :D.

Reinmar commented 7 years ago

From what I understand from https://www.gnu.org/software/gettext/manual/gettext.html#Translating-plural-forms we need to pass a count number to the t() function, so it knows which translation to use.

ngettext [option] [textdomain] msgid msgid-plural count

So it'd be like:

t( 'file-upload/uploaded-message: Uploaded a file. | Uploaded %0 files.', fileCount, fileCount );

Why fileCount is twice? Because the first one is the count, the second is the %0. It's a bit ugly, I know. The first param gets very complicated as well... perhaps these would need to be 3 separate strings then or an object (better, because t() won't change its number of params in the built code). Actually, in this case the count could be moved to that object too:

t( {
    context: 'file-upload',
    id: 'uploaded-message',
    singular: 'Upladed a file.',
    plural: 'Uploaded %0 files',
    count: fileCount
}, fileCount );

Now, thet() function needs to be able to understand which of the plural forms it needs to use. So we'd need to include in the bundles the function which represents a plural formulas, the translate() function will know it and return the proper string based on the passed count. Then, the normal %0 substitution will take place.

If I don't miss anything, I think that we should be able to add support for plural forms in the future. It will require more a complicated logic for t() extractions and getting the plural formula functions somehow...

Reinmar commented 7 years ago

BTW, I haven't mentioned one thing here – the use of external libraries. Obviously, there are libraries which parse and write PO files and we'll use one. However, for the string resolution logic (so what the translate() function should do) I don't think that we'd be able to use any library. It needs to be perfectly integrated with our code and bundling process, so I don't see much space for any 3rd party helpers here.

wwalc commented 7 years ago

Looks great. It would be nice if the tool could list some errors like:

entries without context,
translations of the same context which overwrite each other,
using context which is not described in any contexts.json file,
context definitions that overwrite each other (the same context defined across multiple contexts.json more than once),
finding a pipe in the text and no vars defined, which would mean that the pipe was used as a part of the text (some people will create language files based on existing files, they will not read the full spec to understand the special meaning of a pipe)
other?

ma2ciek commented 7 years ago

@Reinmar Maybe it's more an edge-case, but words in many languages can have more than one plural form. All in all you have written a very clean view of what we need to implement.

jodator commented 7 years ago

@ma2ciek gettext has it already defined and its standard is used in many places.

Checkout this: https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html

It handles polish irregular plural forms.

Reinmar commented 7 years ago

entries without context,

Yep.

translations of the same context which overwrite each other,

Yep.

using context which is not described in any contexts.json file,

Yep.

context definitions that overwrite each other (the same context defined across multiple contexts.json more than once),

Yep.

finding a pipe in the text and no vars defined, which would mean that the pipe was used as a part of the text (some people will create language files based on existing files, they will not read the full spec to understand the special meaning of a pipe)

You mean, in the plural usage form? I don't want to use "|" there, that's why I proposed the object format, as more explicit and less error prone. OTOH, the pipe format could be an option and totally satisfactory in most of the case. But we won't work on this now anyway, so we can discuss this in the future.

Reinmar commented 7 years ago

@Reinmar Maybe it's more an edge-case, but words in many languages can have more than one plural form. All in all you have written a very clean view of what we need to implement.

The idea is that en has 2 forms, so our t() usage needs to predict only that + the count param. The translate() function must support multiple forms, because only it will have to worry about non-eng languages.

wwalc commented 7 years ago

I don't want to use "|" there, that's why I proposed the object format

👍

Reinmar commented 7 years ago

I found a quick tutorial how gettext workflow looks like and it's a bit different than how I imagined it: http://www.labri.fr/perso/fleury/posts/programming/a-quick-gettext-tutorial.html

You write code with e.g. _("Hello World\n").
You run xgettext on this code and create a .pot file (PO template) with such a content:
```
msgid "Hello World\n"
msgstr ""
```
You run msginit to create a .po file for a concrete language.
Now you add translations of this text in that .po file.
Finally, you run msgfmt to create a binary for that .po file, which equals our build step.

It's also important that when you run xgettext again, you run it with the -j (--join-existing) option which... I don't precisely know what it's doing :D. Especially that then you run msgmerge to merge these changes with the concrete .po files.

It would be worth checking how this merging process look and how sending this to Transifex looks like too. E.g. can we just generate en.po files from the code and send them to Transifex skipping the whole update and .pots generation process?

For me the process could look like this, but perhaps I miss something:

Generate en.pos out of the code.
Send them to Transifex.
Get .po files for all other languages back.

jodator commented 7 years ago

AFAIR the .pot is the file which is used as the source of currently available msgs. In time some messages are added and some are removed. Current state of app's messages are in current .pot file.

The msgmerge will update translation files (.po files) so obsolete messages are removed and new strings are added.

This workflow is nicely supported by some tools (namely poedit ) in which you can run .pot file against .po file to update translations and create updated .po file.

So in such workflow the .pot file is needed for updating the .po files in future.

I think that the en.po would be the same as ckeditor5.pot. But I don't know if creating en.po makes sense (dunno how transifex works in that matter).

Reinmar commented 7 years ago

Also, I found https://lingohub.com/blog/2013/07/php-internationalization-with-gettext-tutorial/#What_form_of_msgids_should_be_used which clarifies how the msgids should be created:

As previously mentioned, msgid is used by gettext to identify the msgstr that should be displayed in its place. If multiple equal msgids are present, msgctxt is used to tell them apart. There are several more important roles that msgid plays:

msgid is what the programmer sees in the code

msgid is what is being displayed by default if the proper msgstr in the current locale can not be found (as a fall back)

in traditional gettext usage, the translator translates from msgid to the target locale, instead of from source msgstr to target msgstr; so, msgid is what he sees and what gives him information what should be translated

Therefore it is important to chose the form of the msgid carefully. The value of the msgid can be any string. The gettext manual suggests that the original string in the source (or English) language should be used. That might pose certain problems, for example if msgstr of the original string is changed and if the programmer decides to synchronize the msgid with a new msgstr, he would have to do it in all places in the code where it is used as well as in all other PO files. Also, there is no limit to msgid or msgstr length. Some lengthy msgids would make the code less readable.

Some developers propose that the msgids should be constructed in a more structured way [5] [6], where msgid describes its role in the application or role in the code (in which view, partial, container are they located) instead of its content. It is a great way to organize the strings from the developer’s perspective, but it might pose problems to the translator, since the translator no longer has the a source string in the msgid to translate from.

This reminded me, what I forgot to describe and what have been confusing for you – where in the t() calls do we have msgctxt, msgid and msgstr. So, to clarify this. For the following t() call and contexts.json:

t( 'basic-styles/bold: Bold' );

basic-styles/lang/contexts.json (note, I was previously using a deeper structure for the contexts.json file, but it doesn't make sense – both levels (package name and the id can be together)):

{
  "basic-styles/bold": "Bold button label."
}

The en.po file would look like this:

msgid "basic-styles/bold"
msgstr "Bold"
msgctxt "Bold button label."

So:

As msgid we use the id of that string, not the "Bold" string. Why? The reasons were explained in the post I quoted – the msgid has limited length and, more importantly, you lose all the translations if you change the id. So, if we'd use the real string "Bold" as msgid, if we decided that for some reason it should be "Bold text", then translations would need to repeat their work and for some time we wouldn't have this string translated at all. This was pointed out by @wwalc and it happened (the English text was changed) in the past in CKE4 multiple times.
So, if the msgid of the string is not the string itself, then why defining the contexts at all? We could have e.g. this:
```
t( 'core/button: Button' );
// in other context:
t( 'tailor/button: Button' );
```
Theoretically, the msgid clearly differentiates the use. However, as @wwalc pointed out again, translators on Transifex don't know CKEditor that well, so they might not be able to translate strings well enough based on short msgids. Hence, we need textual, human readable contexts just like we had in CKEditor 4 (e.g. https://github.com/ckeditor/ckeditor-dev/blob/master/dev/langtool/meta/ckeditor.plugin-basicstyles/meta.txt):
Finally, by requiring specifying the msgid in every t() usage, we'll be sure that the developer thought about what context it is, we'll be able to reuse translations from the core package, we'll know whether a certain text was defined in some other package already (we can have a tool to print msgstrs which appear with more than one msgid), etc. It's safe and pretty concise and useful at the same time.

Reinmar commented 7 years ago

Thanks @jodator. TBH, I have still troubles in understanding the real purpose of pot. Perhaps there are situations in which it allows resolving some unclear combinations of which string is defined in which po file. Or maybe it's also related to the topic I've just described – what is msgid? If you don't use real en strings for msgids, then the .pot file contains a real template of all other po files. Even to create the (base) en.po file, you need to add the msgstrs for all the entries, because, AFAICS, the _() function is used only with msgid. So in this case en.pot !== en.po.

But, if we accept what I described above, then en.pot === en.po, so we can generate just the en.po file and, if anyone wants it, the en.po can be used as en.pot.

Reinmar commented 7 years ago

Anyway, @ma2ciek will need to dig into Transifex docs to learn about the proper process.

franciscop commented 7 years ago

Just wanted to chip in to say that, with the new and awesome ES6 string templates, there is a lot of new functionality that can be applied here. For example, I just made a demo in 5 min (jsfiddle):

let translations = {
  'I am $1 $2!': {
    en: 'I am $1 $2!',
    es: '¡Soy $1 $2!',
    jp: (name, lastname) => `私は${lastname.toUpperCase()} ${name.toUpperCase()}です`
  }
};

let name = 'Francisco';
let lastname = 'Presencia';

let translation = t`I am ${name} ${lastname}!`;

Even in this simple example this looks more natural than:

let translation = t('I am $1 $2', name, lastname);

Of course you could modify many things and improve others. The point is, with template strings the syntax is closer to natural English while giving you more flexibility than normal functions.

Mavrin commented 7 years ago

Hi all. I implemented internationalization service for targetprocess. And I talk you how we made it. Maybe it will be useful for you. We use Transifex as well. This is my talk at one conference https://mavrin.github.io/i18n-pres (Russian). I try explain the main idea. We don't use syntactic keys. I mean something like that __("basic-styles/bold") . We use original English phrase __("Bold button label.") . What it gives us? We see in development mode good string and we don't need dictionary for 'en' lang. Function __ have implementation like this

const currentLang ='en'; //this variable we get from user settings
const defaultLang = 'en';
const dictionary = { // it load dynamically from separate bundle
'ru': {
  'Bold button label.': 'Толстая кнопка'
}
}
const translate =() => {/*function which apply plural format or come variable*/} // we use https://github.com/yahoo/intl-messageformat for it
cost __ = (string, params) => {
  if(currentLang === defaultLang) {
    return translate(string, params);
  } else {
    return  translate(dictionary[currentLang][string] || string, params);
  }
}

How we made lang files for others. We put all dictionary to github repo.

We run tool on our CI which parse all code base and make files with dictionary. This tool support context for folder or single file, just specific comment to file or add meta file to folder . You could look it in test https://github.com/TargetProcess/tau-extract-gettext/tree/master/test
We push this dictionary to Transifex with help https://github.com/TargetProcess/tau-transifex/blob/master/index.js#L136-L152
We run sync between Transifex and Github with help https://github.com/TargetProcess/transifex-github-integration. It get all dictionaries from Transifex and push to GitHub and create tag. We created web service wich we can install in your server and call update route. This service allows you to see the full version of the dictionary.

Reinmar commented 7 years ago

Of course you could modify many things and improve others. The point is, with template strings the syntax is closer to natural English while giving you more flexibility than normal functions.

Hi! Thanks for chipping in :) The use of templates was already proposed: https://github.com/ckeditor/ckeditor5-design/issues/136#issuecomment-191770415 and we came to a conclusion that it's not going to work in our case.

We need to synchronise translations through a service like Transifex which means that they need to be plain text strings. Also, with the t() function we're able to pass an object there for a more complicated cases (see https://github.com/ckeditor/ckeditor5/issues/387#issuecomment-271249275).

EDIT:

@franciscop, I took a bit closer look at your demo and it's really interesting. I see that in the translation repository there are plain text strings and that you can change the order of placeholders in a translation and everything works fine:

en: 'I am $1 $2!',
es: '¡Soy $2 $1!',

If we were able to ensure that in the future we'd be able to implement the plural forms (see https://github.com/ckeditor/ckeditor5/issues/387#issuecomment-271249275), then using template strings could be an option. It doesn't sound particularly tricky to have the t() function supporting templates and objects at the same time, so it seems feasible.

Reinmar commented 7 years ago

I edited my previous comment because it didn't make a lot of sense initially and I created https://github.com/ckeditor/ckeditor5-utils/issues/117 – it'll be worth considering adding support for template strings to the t() function.

ma2ciek commented 7 years ago

So I'm working now on the very basic support without plural forms and template strings which we'll add later. I'll be pushing changes to the pull request for the https://github.com/ckeditor/ckeditor5-utils/issues/116

After small searching for the phrase t( ... ), which we want to find, collect and export to the .po file, I have found linkformview.js which is using generic form of t() that crashes a little bit this idea. Are these generic forms more?

Reinmar commented 7 years ago

I have found linkformview.js which is using generic form of t() that crashes a little bit this idea. Are these generic forms more?

That's a bug. t() should always be used with a string. You should throw an error if it's not.

Reinmar commented 7 years ago

I talked with @fredck and we decided to simplify the t() calls by removing the msg id from them. It has some downsides (most of the things listed in https://github.com/ckeditor/ckeditor5/issues/387#issuecomment-271349641), but it has one important upside – the t() calls will be shorter. Since the downsides will occur in rare situations, it's better to focus on something which most of us see – the t() calls.

I updated the spec in the issue description and for you to be able to understand what I changed I created https://gist.github.com/Reinmar/2ca4400d5fb724fd9be88fd903aee8eb/revisions.

Reinmar commented 7 years ago

@Mavrin

Hi all. I implemented internationalization service for targetprocess. And I talk you how we made it. Maybe it will be useful for you. We use Transifex as well.

Thanks for the tips. I can see that we're planning to implement the same solution (especially after simplifying the t() calls), so we're going in the right direction :)

ma2ciek commented 7 years ago

So for the clarification @Reinmar @wwalc I'm going to implement errors after spec update for the following cases:

t() call parameters without relevant translation key in package/lang/contexts.json and core/lang/contexts.json,
translation keys in contexts.json without relevant parameter t() call,
context definitions that are defined across multiple contexts.json more than once for the same translation key.

Reinmar commented 7 years ago

Sounds good to me.

ma2ciek commented 7 years ago

I have created working version of collecting translations from contexts and t() calls with error handling.

Script requires branches https://github.com/ckeditor/ckeditor5/tree/t/387 and https://github.com/ckeditor/ckeditor5-dev/tree/t/44. Also it requires latest masters of other ckeditor5-* packages.

Usage:

gulp translations:collect

Expected result: following logs and ckeditor5.pot with English strings and contexts.

[11:25:33] Using gulpfile ~/data/www/ckeditor/ckeditor5/gulpfile.js
[11:25:33] Starting 'trans:collect'...
[11:25:34] Created file: /home/maciek/data/www/ckeditor/ckeditor5/build/.transifex/ckeditor5.pot
[11:25:34] Finished 'trans:collect' after 528 ms

Reinmar commented 7 years ago

Just please call the task translations:collect ;)

jodator commented 7 years ago

Another proposal: i18n:collect.

Reinmar commented 7 years ago

i18n is about more than translations, so it's not very precise.

ma2ciek commented 7 years ago

Ok. Thanks. It's fixed. now there is a task called translations:collect. And the output file is now ckeditor5.pot, not english.po.

luzfcb commented 7 years ago

please, use a standard to define the language files e translations files, like ISO639-1 (or variations), IETF BCP47

Reinmar commented 7 years ago

@luzfcb, we'll use formats supported by gettext. This is a very common standard so there will be no problem with interoperability if that's what you meant.

ma2ciek commented 7 years ago

@luzfcb https://www.gnu.org/software/trans-coord/manual/gnun/html_node/PO-Header.html - there is even utf-8 in the header example .

fredck commented 7 years ago

I think @luzfcb is talking about the name of the language files. We'll definitely use ISO 639-1 for that (e.g. English is "en"). I was also scared about the previous "english.po" name :)

fredck commented 7 years ago

Actually, to be more precise, it'll be IETF with ISO 639-1.

luzfcb commented 7 years ago

@Reinmar @ma2ciek Sorry if I was not clear enough. The standard I'm talking about is how to name the translation files as @fredck mentioned.

I had a problem with select2, because select2 does not use a known standard for naming translation files. This made it difficult for me to set up an equivalence table between the standard used by Django Framework and the file names used by select2, to force select2 use the language selected by user.

Django uses gnu gettext and store the translations files (.po and .mo) inside a "locale" directory, like https://github.com/django/django/tree/master/django/contrib/admin/locale These files are dynamically found by the translation functions.

https://docs.djangoproject.com/en/1.10/topics/i18n/ https://docs.djangoproject.com/en/1.10/topics/i18n/translation/

ma2ciek commented 7 years ago

@luzfcb Thanks for the clarification. We'll use ISO 639-1 standard.

I'm working now on the 3. and 4. step (download and translation maps). I have chosen https://github.com/smhg/gettext-parser to parse PO files (project has 40 stars, but It's still maintained and the most popular po to js parser I have found). And now I'm testing the Transifex API on the simple project to automate 3.

ma2ciek commented 7 years ago

We decided with @Reinmar to upload one file per package to get easy download process. Otherwise there is a problem with splitting one big translation file into many packages accordingly to their contexts.json files.

It's important to remember that all translation keys are unique across the contexts.json files, so they won't be duplicated on Transifex.

Reinmar commented 7 years ago

We have a problem with implementing the plugin for Webpack which would be able to generate modules with translations on the fly, based on discovered (imported) modules. We asked a question on StackOverflow if you want to help: http://stackoverflow.com/questions/41869498/how-to-write-webpack-plugin-which-adds-modules-to-the-bundle-on-the-fly-based-on.

Any help would be greatly appreciated!

ma2ciek commented 7 years ago

Ok. So for now there will be only single language support. We'll add the multi-language feature in the future when the problem will be resolved.

There are exported 3 main tasks in the ckeditor5 package:

gulp translations:collect
gulp translations:upload -u user-p password
gulp translations:download -u user -p password

First task creates .pot files from the t() calls for each package that contains 'contexts.json', checks for the errors and saves these .pot files in the build/.transifex directory. Second task uploads .pot files on the Transifex. It creates resource if the package translations are uploaded for the first time, otherwise it updates existing resource. Third task only downloads all translations in .po file format from the Transifex and saves them in the [packageName]/lang/translations/[language_code].po format.

After translation downlaod, the CKEditor5WebpackPlugin will be able to take languages option:

plugins: [
    new CKEditorWebpackPlugin( {
        languages: [ 'pl' ]
    } )
]

During the compilation plugin is replacing first parameter of the t() calls with the translated strings in the bundled source of ckeditor5.

It should be noted, that using function named define() throws errors because of the https://github.com/ckeditor/ckeditor5-dev/issues/59 issue.

Scripts require https://github.com/ckeditor/ckeditor5-utils/tree/t/116, https://github.com/ckeditor/ckeditor5/tree/387 and https://github.com/ckeditor/ckeditor5-dev/issues/44.

Reinmar commented 7 years ago

Great to see progress on this :)

A couple of questions:

Third task only downloads all translations in .po file format from the Transifex and saves them in the [packageName]/lang/translation/[language_code].po format.

I'm not sure about this path. The spec said translations should be placed in lang/ directly. I guess you wanted to separate those po files from contexts.json and en.pot, I guess it makes sense. However, then the correct path would be lang/translations/.

Scripts require https://github.com/ckeditor/ckeditor5-utils/tree/t/116, https://github.com/ckeditor/ckeditor5/tree/387 and ckeditor/ckeditor5-dev#44.

Is any of these on review two? Please make PRs in every repo if those branches are supposed to be reviewed together.

Reinmar commented 7 years ago

I've been checking the state of the env utils and I found this: https://github.com/ckeditor/ckeditor5-dev/compare/t/44#diff-71629f5ae239e9d11642ad3fec3911c1R33.

This is not going to work in practice. The t() call may be formatted in various ways and used in at least two contexts (after space and after tabulation), but that happens only in our code style. If different code style is used, there are many more variations. Besides, it's going to break with a simple string containing ")" character.

Let's keep this code for now but it must be improved. It must be using a proper JS parser.

ma2ciek commented 7 years ago

EDITED

I'm not sure about this path. The spec said translations should be placed in lang/ directly. I guess you wanted to separate those po files from contexts.json and en.pot, I guess it makes sense. However, then the correct path would be lang/translations/

Yes, I made a typo. Translations are saving under the lang/translations/ path.

Is any of these on review two? Please make PRs in every repo if those branches are supposed to be reviewed together.

I'll fix few issues and I'll create a pull request within an hour.

I've been checking the state of the env utils and I found this: https://github.com/ckeditor/ckeditor5-dev/compare/t/44#diff-71629f5ae239e9d11642ad3fec3911c1R33. This is not going to work in practice. The t() call may be formatted in various ways and used in at least two contexts (after space and after tabulation), but that happens only in our code style. If different code style is used, there are many more variations. Besides, it's going to break with a simple string containing ")" character.

I had to misunderstand implementation of this part. I hope it's not going be hard to add babel or similar parser here to find t() call occurrences. There is already a Webpack parser that solves this issue: https://github.com/webpack-contrib/i18n-webpack-plugin/blob/master/index.js, but it won't be able to share code logic across Webpack and Rollup plugins.

ma2ciek commented 7 years ago

After a week I improved the t() calls find and replace using the Acorn parser and modified version of Escodegen generator (which master do not use latest version of Estraverse and throws errors during export default declaration parsing). So for now ckeditor5-dev-packages Uses the fork of the Escodegen. I'll report that issue.

I changed the way the .po files are being imported from the packages, but it turned out, that this part has to be explicitly declared in the CKEditorWebpackPlugin options. The ckeditor5-core package which is the default translation package must be resolved first to provide translations that other packages may use.

It took quite a lot of time compare to the effects, mainly because of the researching in out of date webpack plugins and webpack documentation. I hope it'll change soon.

Reinmar commented 7 years ago

I changed the way the .po files are being imported from the packages, but it turned out, that this part has to be explicitly declared in the CKEditorWebpackPlugin options. The ckeditor5-core package which is the default translation package must be resolved first to provide translations that other packages may use.

How does the process look like? Is it split into two parts:

discovering paths to packages,
modifying the files

or not?

If it is, then I don't see a problem in sorting the translations in a way that the core will always be first. Besides, why does it have to be first in the first place? There should be no conflicts between translations provided by packages anyway.

Reinmar commented 7 years ago

After a week I improved the t() calls find and replace using the Acorn parser and modified version of Escodegen generator (which master do not use latest version of Estraverse and throws errors during export default declaration parsing). So for now ckeditor5-dev-packages Uses the fork of the Escodegen. I'll report that issue.

How's that possible that Escodegen isn't supporting all ES6 yet? How's Webpack working with ES6? Is it always pre-transpiling the modules to CJS or some custom format?

ma2ciek commented 7 years ago

How does the process look like? Is it split into two parts: discovering paths to packages, modifying the files or not?

The process isn't split. For each js file in ckeditor5-* the po file of the package is being added (if wasn't added earlier) and then the file is being translated. So the package can't use ckiedtor5-core translations which may be defined later.

Reinmar commented 7 years ago

The process isn't split.

Can't we split it? ;< Or somehow locate and add ckeditor5-core automatically?

ma2ciek commented 7 years ago

We can add manually ckeditor5-core. But we have to know where the package physically is. So this should be written explicitly in CKEditorWebpackPlugin options.

ma2ciek commented 7 years ago

Can't we split it?

The loader is used just after resolver for each file. So this part can't be separate if we don't know paths to the po files in the packages first.

ckeditor / ckeditor5

Implement translation service (i18n) #387

Process

Implementation

`ckeditor5-core/src/editor~Editor#constructor`

`ckeditor5-*/lang/contexts.json`

`ckeditor5-utils/src/locale`

`ckeditor5-utils/src/translation-service`

PS. PO file generation