Closed SlexAxton closed 5 years ago
Hi Alex, it would be great to collaborate on this and other i18n stuff.
We evaluated whether we could use your MessageFormat lib, but realized what we wanted to do would mean it would be a full rewrite. I did however start with your PEG parser, but made some changes so create a more descriptive and expressive AST. We also wanted the parser to work stand-alone, as its own package.
It would be great to began a collaboration around the parser: https://github.com/yahoo/intl-messageformat-parser
As for the message format runtime, we first reviewed the strawman proposals for message formatting and your MessageFormat lib. We knew we wanted an API that had a similar shape to the other Intl
APIs in ECMA-402. And we wanted the library to be optimized for repeated calls to format()
(as you did your library).
Beyond gender selection and pluralization, we also wanted to have built in Number and DateTime formatting happening within messages so we're better aligned with the ICU message pattern "spec". I noticed that you had a TODO for leveraging NumberFormat — this is something we needed since the current i18n tools we are using at Yahoo already format messages assuming this feature.
Our experience helping developers use Handlebars lead us to realize that the whole precompiled template stuff is onerous for people to use or even grok. Therefore we wanted move away for that and simply have the library work with message strings which are parsed, once, at runtime, and the IntlMessageFormat
instance can be reused for making repeated calls to its format()
function. We also needed to make sure that the library could fully operate in a CSP-restricted environment and not use direct or indirect eval
.
We also realized that we wanted to build more things on top of message format, like relative time, and also integrate with template/component libs like Handlebars and React — since this is the place where developers actually need to output formatted strings. In order to accomplish this sort of bundling up of libs that work on the client and server, and to be future focused, we wrote the source of the library (and our other i18n libs) as ES6 Modules.
That's the gist of our thinking behind why we wanted to implement a message format library instead of suggesting a bunch of major changes to your existing library. That said, we do have an end goal, which is that we want message formatting to eventually become a part of ECMAScript. It would be great to have the two libraries share common pieces and ideas, and also collaborate with you on moving message formatting to JavaScript itself.
@SlexAxton shame on me for not doing the proper follow up on the thread that we started via email back in January, I should have tried harder to align on these efforts. But I guess we were just trying to figure things out, now we know better, now we can take the proper steps.
As @ericf mentioned, we can start with the parser, I think that's the most important part, then we can jump into the CLDR data compilation and consumption part, which can be shared between the two components. Since these two pieces are not going to affect the end users of the libraries, we can get them in sync, while only the sugar API and the basic logic to apply the CLDR rules will be the custom parts.
The other side of the coin, as @ericf mentioned, it is to push hard to get some of this into ECMA 402, we need all the help we can get :)
(This is a scatterbrained top-post, but hopefully it's not too hard to follow)
Yea, parser definitely makes sense as a good starting point. It looks pretty much the same with a few changes to character nodes, so I could probably merge changes with little to no trouble, and we can work from the same core.
Re: ECMA 402 - Totally agree this is the direction to go. I totally would have done this from the get-go, but the messageformatty stuff in the ecma globalization spec didn't really exist when I was building messageformat.js
I think the precompile steps and concerns that I've placed in my library, are a good enough reason to keep two separate run-times, but my gut is that they should be almost entirely interchangeable as far as input goes.
I'm on the verge of adding the newer ICU MessageFormat stuff with things like selectordinal
and ListFormatter
. I'd love it if our collaboration allowed that to be easily added to your lib as well.
Re: data compilation and consumption, my gut is that we should hop on to something like https://github.com/rxaviers/cldrjs .
<tangent>
You guys have any idea how the @decimals stuff works in the plurals info for CLDR? I'd love to get that fixed but I can't believe that it's as dumb as I currently understand it.
en: { 'other': '... @decimal 0.0~1.5' }
As far as I know, that means that 0.0
, 0.1
to 1.4
, 1.5
all hit the other
keyword, but specifically that exact amount of decimals (aka only those 15 specific numbers). That seems so hopelessly useless, that I can't believe I'm understanding it correctly.
I'd love to get decimal support in there.
</tangent>
re: numberformat/dateformat. I think this is a good idea, but the ICU peeps seem to think you should format this stuff prior to sending it into the string. I hate to think I know more than them, but my gut is that we should still do it. My blocker to date has been coming up with a future looking syntax for adding this stuff. You seem to imply that ECMA402 talks about this a bit, so I'll read that again.
Lastly, I agree that integration with templating and module libraries is key, I don't know if there are as many places to collaborate here, but I figured I'd explain some of the fun ways I've done these things to see if any are interesting enough. I started work on Handlebars integration that I think is really nice (I have a not as cool version in production at Bazaarvoice, though that one 'worked' as they say.): https://github.com/slexaxton/mfbars
Essentially, people already compile their templates, so I just piggy back on that phase. Write your messages directly into your templates, and we can pull them out easily enough and generate data files for translation vendors. And we can sub in the translated messages just as easily.
You still end up needing some messages directly in your code. I made a require.js plugin for this internally as well that's super simple, but that's not really a global solution.
require(['mf!somejsonfile'], function(mfpack){ alert(mfpack.key({data:1})); });
It would compile on-the-fly in dev mode, and precompile to modules in built mode.
My gut is that I should see if I can't get your branch of my peg to work directly with messageformat.js as a first step.
And then maybe a good next step would be setting up a JS MessageFormat Dialect Standard/Spec
that we both follow. Ideally it would be as close to working in the Java and C implementations (at least be fully compatible with that subset), but then standardize more things like formatters, and how to handle blank keys and undefined keys, etc.
Once we could come up with a spec that we all liked for the additions, we could work on adding them to our shared parser and create a test suite (in the nature of the Promises/A+ suite) that all implementations should strive to pass.
I think if there's one big win here, it's getting enough mind-share to make an actual specification so there can be tooling interoperability, rather than the current world of looking at the docs for the Java library, and hoping they're clear. (seriously, how is there no formal specification for this stuff?)
Thoughts? Additions? Concerns?
:thumbsup: (wanted to let you know I read through what you wrote, seems like we're all on the same page! I'll respond later with some more thoughts.)
re: numberformat/dateformat. I think this is a good idea, but the ICU peeps seem to think you should format this stuff prior to sending it into the string. I hate to think I know more than them, but my gut is that we should still do it. My blocker to date has been coming up with a future looking syntax for adding this stuff. You seem to imply that ECMA402 talks about this a bit, so I'll read that again.
<tangent>
we have a lot of discussions around this topic, including the usage at the higher level (template level), and something struck me while doing that, why should I format the value if I don't know if the value will be used or not, basically, if we consider something like this:
COMMENTS: |
{num, plural,
=0 {no comments}
=1 {one comment}
other {# comments}
}
When num
is bigger than 1000, in english we should get 1,000 comments
, but in french, we should get 1 000 comments
.
The problems with pre-formatting that numeric value are:
1 you need the row value to apply the plural logic but also the pre-formatted value for presentation. 2 the pre-format value is probably not going to be used, if num is lower than 2
these two problems add a lot of complexity that we don't want/need. we are up for this battle with the ICU folks.
on the other hand, so people think this is crazy, and not needed, but we do have a perfect example of this use case: if you use Yahoo Mail today, you will probably see +999 emails
in your inbox as a workaround to avoid formatting the value. LOL
</tangent>
Haha. I'm totally on board for doing it, just wanna get it right. I re-read my sentence and realized it was ambiguous:
I hate to think I know more than them, but my gut is that we should still do it.
To be clear, my gut is that we should add the formatting stuff directly in.
You guys have any idea how the @decimals stuff works in the plurals info for CLDR? I'd love to get that fixed but I can't believe that it's as dumb as I currently understand it.
en: { 'other': '... @decimal 0.0~1.5' }
As far as I know, that means that 0.0, 0.1 to 1.4, 1.5 all hit the other keyword, but specifically that exact amount of decimals (aka only those 15 specific numbers). That seems so hopelessly useless, that I can't believe I'm understanding it correctly.
@lwelti, do we use decimals in any form today?
where did you get 0.0~1.5 ?
(1) is the use case to convert 1450.7 to: 1,450.7 or (2) is the use case to convert from 1,450.7 to literal? like: one thousand four hundred fifty point seven ?
(2) is not totally supported in many languages.
This is from the plurals.json
file in the CLDR download. It's for, I think, (3) pluralizing words after a decimal number.
1 dollar 1.25 dollars
correct, the pattern for a lot of locales is:
one: {0} {1}
other: {0} {1}
where {0} will be the formatted currency and {1} the text
except for "arabic" where you have: zero, one, two, few, many.
so for English, if is 1 then one.
but I am little bit intrigued with the 0.0~1.5 , is that inside the plurals.json file? do you mind pointing me to that file.
because I just downloaded the latest version of cldr json and I don't see that.
I understand the patterns for integers well. I'm specifically looking at pluralizing decimals.
I uploaded the specific file here: http://slex.me/code/0y191t2b4347
This type of stuff is mostly viewable on their site though: http://www.unicode.org/cldr/charts/latest/supplemental/language_plural_rules.html
I think I see what I'm getting wrong. Those are strictly examples
-- not anything useful for an implementation. I guess that leads me to believe that decimals should Just Work™.
I'll see what I can do about testing that assumption on weird languages.
that seems correct :)
those are just examples after the rule
what you care is the rule, for example in Arabic n = 0 (zero) n = 1 (one) n = 2 (two) the for (other) the rest...
here is the doc: http://www.unicode.org/reports/tr35/tr35-numbers.html#Samples
"Samples are provided if sample indicator (@integer or @decimal) is present on any rule. (CLDR always provides samples.)"
So specifically on the french rule, it seems like the examples don't line up with the data:
{
"fr": {
"pluralRule-count-one": "i = 0,1 @integer 0, 1 @decimal 0.0~1.5",
"pluralRule-count-other": " @integer 2~17, 100, 1000, 10000, 100000, 1000000, … @decimal 2.0~3.5, 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, …"
}
}
The rule for one
is i = 0,1
i
is defined as "integer digits of n." (and n
is defined as "absolute value of the source number").
How does it then follow that 0.0~1.5
would be in the one
grouping?
means having the next possible values for "one": 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Hmm maybe something got cut off?
from my side, nop, that was the full post.
which means if "i" is 1 or 0 and then it may or not have decimal part
Ahhhh. Thank you soo much.
Integer digits of n.
That was confusing me. It literally means like a Math.floor
on the number.
Perfect. Thanks!
take a look in this section: http://www.unicode.org/reports/tr35/tr35-numbers.html#Operands
related to ListFormatter, the implementation is not so complex, I did one in php for internal use, so I removed some specific yahoo methods, but mostly if you can read the separators, etc from CLDR then you are on the other side.
reference: https://github.com/lwelti/i18n_etc/tree/master/lists
Yep. Both ListFormatter and selectordinal seem pretty easy to do, just wanna make sure we do em the same, and that it's roughly equivalent to what IUC4J does.
well ordinal not so easy because not all the languages are covered. for example in German they won't say 1st, 2nd, they will say it literal: erste, zweite, or in Spanish they may have: 2º and of course let's not forget singular and plural.
Related to List Format, the example I did on php, covers two cases: First Case: Alex, Caridy and Eric like soccer. Second Case: Alex and 2 people like soccer.
one is a list of items with a separator and the second case is 1 item and number of rest of elements.
for example in German they won't say 1st, 2nd, they will say it literal: erste, zweite, or in Spanish they may have: 2º
It's my understanding that in MessageFormat this would be covered with literals (just like a lot of none
languages)
You are the {rank, selectordinal, =1{first} =2{second} one{#st} two{#nd}} person in the list!
I doubt that @ericf likes football (the real one, not the american thing...) LOL <trolling>
The second case would seem to be covered under the offset
extension of PluralFormat
nicely:
{num_people, plural, offset:1
=0 {No one likes soccer.}
=1 {Just {first_person} likes soccer.}
one {{first_person} and one other person like soccer.}
other {{first_person} and # other people like soccer.}
}
Jumping back into this thread. I recently had to setup a new internationalizable app in ember, and it made so much sense to just use ember-intl. It's got date and number/currency stuff in there as well as messageformatting. I've had a pretty good go at it, and it's working well. Thanks!
That said, I still kind of wanted to use the messageformat.js implementation of MessageFormat for some of our newer features/speed/precompilation-ability/i-wrote-it-so-it-feels-weird. I of course didn't, because that'd be ridiculous, but it made we wonder if you folks were interested in consuming it downstream?
We're happy to allow a relicense/sign-something to make yahoo-legal folks happy, if that's a concern. We are gaining tests on edge-cases and added explicit support for selectordinal/decimals/shorthand since our two implementations have existed. We've got a good system for pulling in latest CLDR data automatically with make-plural. And we've got the beginnings of tools to help with translation (and would like to focus on building more things in this direction). We're also happy to rebase on top of the messageformat-parser stuff (but would probably require some updates for our newest features).
I just always dislike situations when there are multiple good implementations of the same thing with slight variances in support/updates/compatibility. If there's any way we can join forces, I think it'd be better for everyone, as well as making sure the 402 standardization process has maximum implementation testers. I know globalize.js is pulling us in to do a similar thing, so we're happy to play more of a dependency role like this (even if intl-messageformat
still explicitly existed, pulled in messageformat
as a dependency, modified the API in the way that was best, and shipped the wrapper).
What would you guys need to see from messageformat.js to make this desirable for format (I know precompiling was a turn-off, for instance)? Totally cool if you'd rather keep on with the current setup, I just figured I'd throw this out for collaboration's sake. Also happy to talk more privately about it over email or something if github issues is not ideal.
Also happy to talk more privately about it over email or something if github issues is not ideal.
I think we should just have a higher bandwidth conversation in person or over video to discuss this, and then we can put the notes back into this thread. Let's try to get something going for this week since @caridy will be going on a long vacation soon.
Happy to do so: alexsexton@gmail.com - free most days during the day, with rare exceptions. Send me an invite for something that works for ya'll?
I was just comparing the 2 libs to figure out which one to use in my up coming (node) project. I'm SO GLAD to see that there is a desire to collaborate and that this thread is still active (even tho it started about 7 months ago).
Really looking forward to this :tada: :smile:
Any news on the collaboration?
@glen-84 the focus is collaborating at the standards level. Recently we've been working to get the lower level pieces in place like exposing abstract locale operations, plural rules, and formatToParts
: https://github.com/tc39/ecma402
That's cool. It'll be great to have these APIs available in browsers (and in Node).
I'm SO GLAD to see that there is a desire to collaborate and that this thread is still active (even tho it started about 7 months ago).
I'm glad to see collaboration happening too! I'm about to begin an Ember.js project and plan on using ember-intl. One of the concerns my team has with using ember-intl is a missing precompile step at build-time. Without a precompile step, including the compiler increases the filesizes significantly.
Please reconsider the decision to leave out a precompiler.
close due to stale
Seems like we're looking for the same ends. Hadn't realized that this was based off of my peg, which makes it even easier to collab.
What are the blockers?