commerceguys / intl

A PHP internationalization library, powered by CLDR data.
MIT License
361 stars 45 forks source link

The currency symbol/code is not always properly spaced #65

Open bojanz opened 6 years ago

bojanz commented 6 years ago

We currently rely on the pattern to give us the complete layout of the final number. But CLDR has additional rules that say when a space should be inserted around a currency symbol/code, which look like this:

"currencySpacing": {
            "beforeCurrency": {
              "currencyMatch": "[:^S:]",
              "surroundingMatch": "[:digit:]",
              "insertBetween": " "
            },
            "afterCurrency": {
              "currencyMatch": "[:^S:]",
              "surroundingMatch": "[:digit:]",
              "insertBetween": " "
            }
          },

Yes, that's quite confusing, which is why I missed it previously.

Looks like this is a good opportunity to check how our formatting logic compares with the ICU4J one.

Relevant links: https://github.com/angular/angular/issues/20708 https://github.com/andyearnshaw/Intl.js/issues/221

bojanz commented 4 years ago

I analyzed the dataset. All number formats have the same currencySpacing data. That means we can avoid parsing it, and just implement the relevant logic directly in the number formatter.

Also note that even beforeCurrency and afterCurrency rules are the same. What remains is:

              "currencyMatch": "[:^S:]",
              "surroundingMatch": "[:digit:]",
              "insertBetween": " "

Translated into English, that is "By default a space is automatically added between letters in a currency symbol and adjacent numbers."

Quoting https://unicode.org/reports/tr35/tr35-numbers.html for a source:

This element controls whether additional characters are inserted on the boundary between the symbol and the pattern. For example, with the above currencySpacing, inserting the symbol "US$" into the pattern "#,##0.00¤" would result in an extra no-break space inserted before the symbol, for example, "#,##0.00 US$". The beforeCurrency element governs this case, since we are looking before the "¤" symbol. The currencyMatch is positive, since the "U" in "US$" is at the start of the currency symbol being substituted. The surroundingMatch is positive, since the character just before the "¤" will be a digit. Because these two conditions are true, the insertion is made.

Conversely, look at the pattern "¤#,##0.00" with the symbol "US$". In this case, there is no insertion; the result is simply "US$#,##0.00". The afterCurrency element governs this case, since we are looking after the "¤" symbol. The surroundingMatch is positive, since the character just after the "¤" will be a digit. However, the currencyMatch is not positive, since the "$" in "US$" is at the end of the currency symbol being substituted. So the insertion is not made.

That also gives us a good example for tests.