formatjs / intl-messageformat

[MIGRATED] Format a string with placeholders, including plural and select support to create localized messages.
http://formatjs.io/
Other
530 stars 77 forks source link

How should unmatched tokens be handled #22

Closed ericf closed 10 years ago

ericf commented 10 years ago

When processing tokens, what happens to unmatched ones? Also what happens to tokens like ${foo bar} (garbage)?

apipkin commented 10 years ago

In my opinion, legit tokens (${key} or {key}) should throw an error and halt operation. Garbage tokens (${k ey} or ${k\ney}) should be left in the message as is because there is no legit way rule out an actual ${ } use in the pattern outside of a token to identify a garbage token.

ericf commented 10 years ago

I think we should start by researching how this is handled elsewhere, in other programming languages, or if there's a ICU spec. that explains how these ought to be handled.

Once we have all the data, then I think we can form a better opinion about what to do in this package when these exceptional cases are encountered.

lwelti commented 10 years ago

on the php MessageFormat is an exact match. if you have: "Welcome to {CITY}, {STATE}"

then you need to pass: array("CITY"=>"Sunnyvale", "STATE"=>"California");

having: "Welcome to {City}, {STA TE}" will FAIL, it returns FALSE

fails because you are passing "CITY" instead of "City", and "STATE" instead of "STA TE"

apipkin commented 10 years ago

But is {STA TE} considered a valid token to be matched?

lwelti commented 10 years ago

no it is not considered, needs to be an exact match.

apipkin commented 10 years ago

No no, I mean does {STA TE} create a token to be looked for in the object. That being:

var msg = new IntlMessageFormat("${STA TE}");
var m = msg.format({ "STA TE": "California" });

does m look like "California" or "${STA TE}".

If STA TE is a valid token, it should return "California". If it is not a valid token it is, it should return "${STA TE}".

I do not feel a token with white space should be considered a valid token. That's my 2¢.

lwelti commented 10 years ago

token with spaces are not considered valid in the php implementation.

apipkin commented 10 years ago

Based on what I could find:

C/C++/Java all use an Array of items with tokens as {0, type, options} where the first option is the index in the Array. If the index isn't in the array (token is not found) the application fails, either gracefully or not.

PHP uses a similar approach but returns false.

PHP also doesn't not identify with the associative array aspect (where the token is the key for the value in the array) as we are doing with Objects in JavaScript ({token: value}). There is however a bug reported (and a fixed merged in) https://bugs.php.net/bug.php?id=61871 that allows for associative arrays to be used. This too returns false when a token is either invalid or not found.

With taking the associative array/keyed array/hash table/object approach, providing users the ability to have complex tokens would not be worth it as it would greatly diverge from indexed array approach.

That being said:

lwelti commented 10 years ago

do you mean that if we have: {STA1TE} is the biggest state of USA.

won't work? in php it works. but I don't mind limiting to [a-zA-Z] if that is a Javascript limitation.

lwelti commented 10 years ago

for the unmatched I will say return false.

apipkin commented 10 years ago

There are two different states: constructor and format

constructor

var msgA = new IntlMessageFormat("{STATE}");
// returns new object 
// msgA.pattern === [{ valueName: 'STATE'  }]

var msgB = new IntlMessageFormat("{ST ATE}"); 
// returns new object 
// msgB.pattern === ["{ST ATE}"]

var msgC = new IntlMessageFormat("{ST1ATE}"); 
// returns new object 
// msgC.pattern === [{ valueName: 'ST1ATE' }]

Given these objects...

format()

// msgA
msgA.format(); // throws error
msgA.format({ FOO: 'bar' }); // throws error
msgA.format({ STATE: 'Missouri' }); // "Missouri"
msgA.format({ "ST ATE: 'Missouri' }); // throws error
msgA.format({ ST1ATE: 'Missouri' }); // throws error

// msgB
msgB.format(); // throws error
msgB.format({ FOO: 'bar' }); // "{ST ATE}"
msgB.format({ STATE: 'Missouri' }); // "{ST ATE}"
msgB.format({ "ST ATE: 'Missouri' }); // "{ST ATE}"
msgB.format({ ST1ATE: 'Missouri' }); // "{ST ATE}"

// msgC
msgC.format(); // throws error
msgC.format({ FOO: 'bar' }); // throws error
msgC.format({ STATE: 'Missouri' }); // throws error
msgC.format({ "ST ATE: 'Missouri' }); // throws error
msgC.format({ ST1ATE: 'Missouri' }); // "Missouri"

The same would be the result regardless of the existence of the $

Does this sound right? Is there any other case that should be explored?

lwelti commented 10 years ago

msgA

lwelti commented 10 years ago

msgB and msgC invalid since the constructor. msgA is correct in the constructor and the rest of the examples on the format are throwing what is expected.

apipkin commented 10 years ago

msgB and msgC should be... false? Since the constructor should return false? Or should throw an error?

Are we saying that any non [a-zA-Z] character between { and } should fail the message format creation?

lwelti commented 10 years ago

throw error, my mistake.

we have a lot of cases where we have {0} is the capital of {1} which is the biggest state of {2}

so from my point of view we should support [a-zA-Z0-9_-] even php supports STA$TE , it does not support ! and other ones, so we can limit the support to only those.

apipkin commented 10 years ago

I'm good with that. I'll add tests and adjust the code accordingly. After the pull request is issued, I'll link to this for comments to make sure they are captured.