Use AST for extraction - Githubissues

TLDR of changes:

Use AST to extract translations, instead of xgettext-template/xgettext
Generate POT file from combined extracted translations from js & hbs instead of separately

TLDR of improvements:

Massively improved extraction time - in a big example app, extraction time was down from ~27 minutes to 8 seconds!
Extraction from template literals now works - e.g. l10n.t(`translate this`)
Extraction will now throw on invalid syntax, instead of silently ignoring it
Since extraction for .hbs & .js is now combined, it will correctly keep comments for both file types for a single translation
We now also keep column position of translations, in addition to file name & row

Improved extraction

We used to use xgettext-template & xgettext to extract translations. While this worked, it was very slow - as we needed to do a lot of file read & write operations as well as executing shell commands.

Instead, this PR manually extracts translations through AST (Abstract Syntax Trees). These allow us to parse the code and find translations, while maintaining context of where/how they are used.

This has two major benefits: First, it is massively faster, since we now only use 2 file write operations, no matter how big the app. This leads to extraction time reduction from possibly thousands of seconds to a few seconds.

The other benefit is that the AST gives us more information about where and how code is executed, making is possible to write nice validation for our translation invocations.

I added a bunch of tests, trying to cover as much ground as possible to make sure everything still works as expected. As far as I could tell, it should handle all cases. But we should probably still put this in an upcoming major release, as it is a pretty big change with potential side effects.

Note that we now also save the column where a string was used in the .pot file - since the AST gives us that information anyhow, I figured we might as well also store it. I guess this could help with syncing changes back as well.

Additionally, the order of the extracted messages is now defined by: first file name, then line number. This might lead to a slightly different ordering of extracted messages, but it should be stable with every consecutive extraction.

Also, since we now collect all translations for .js & .hbs files into one " bucket", we can now also show occurrences across these file types. Where previously, if a string would occur in both a .js as well as a .hbs file, we would end up with a messages.pot file like this:

#: tests/dummy/app/services/l10n.js:7:12
#: tests/unit/services/l10n-test.js:169:6
msgid "en"
msgstr ""

Here, the .hbs file comments are not correctly merged together. With the new extraction, we now get a correct list:

#: tests/dummy/app/services/l10n.js:7:12
#: tests/dummy/app/templates/test-page.hbs:19:0
#: tests/unit/services/l10n-test.js:169:6
msgid "en"
msgstr ""

HBS extraction

We use the Glimmer VM to generate an AST of a .hbs file, and parse the translation strings out from there.

JS extraction

We use esprima to do AST extraction for the JavaScript modules.

Note that since we now use AST based extraction, it will only extract member functions. This means that the following will NOT be extracted anymore:

t('text'); // <-- not extracted
this.t('text'); // <-- extracted

Validation

Previously, the extractor would simply ignore invalid gettext syntax. While this was very error tolerant, it meant that you might never even know that certain parts of your application are not correctly translated.

To improve this, we now instead throw errors during extraction time. Here are some things that we validate:

HBS validation

{{! translation string needs to be a string }}
{{t myVariable}}
{{t (concat 'test' 'other')}}

{{! translation helpers need to have correct number or arguments }}
{{t }}
{{n 'singular'}}
{{pt 'test'}}
{{pn 'test' 'plural' 3}}

JS validation

// translation string needs to be a string
l10n.t(myVar);
l10n.t(isTrue ? 'yes' : 'no')

// translation methods need to have correct number of arguments
l10n.t();
l10n.n('test');
l10n.pt('test');
l10n.pn('test', 'plural', 3);

// While template literals are allowed, they may not contain placeholders
l10n.t(`translate this to ${ isTrue ? 'yes' : 'no' }`);

(Breaking) changes

This should mostly be backwards compatible. The runtime itself has no changes - all changes only affect running ember l10n:extract.

While the extraction itself should cover everything the same way it did before, it is a quite massive change, and could theoretically have unintended side effects. I added a whole bunch of tests to try to cover everything, but it's hard to be 100% sure.

I also ran the new extraction on a pretty big app of ours, and everything seemed to work just fine. I did find some validation related issues that I had to resolve to continue extraction, but for me that's not really "breaking", since it only actually shows invalid implementation code anyhow, that previously was just silently ignored.

Also, the order of messages in your .pot files will change - but this should not affect any code, and the order should be considered volatile anyhow.

The one really potentially "breaking" change is that we stop extracting plain t() function, and only parse member functions anymore:

t('text'); // <-- not extracted
this.t('text'); // <-- extracted

If you previously relied on this (somewhat hacky) behavior to force the parser to pick up translations, you will need to change that code to use a member function instead.

Cropster / ember-l10n

Use AST for extraction #43