d3 / d3-format

Format numbers for human consumption.
https://d3js.org/d3-format
ISC License
624 stars 104 forks source link

Support CLDR JSON #34

Closed thedavidmeister closed 7 years ago

thedavidmeister commented 7 years ago

Just ran across https://github.com/d3/d3-format/issues/21 trying to solve the same problem (no existing support for indian number formatting).

I've just used the google closure i18n lib to implement the en-IN locale for everything except my d3 axis so I've already got access to CLDR data in my app and would like to continue using what I already have if possible. This would be a relatively common case as CLDR is part of unicode and the JSON packages are consumed by several existing JS i18n "engines", e.g.:

Any particular reason why d3 has defined its own locale JSON format (other than historical reasons) instead of using the Unicode CLDR system?

From what I can see:

Benefits of CLDR:

Cons of CLDR:

curran commented 7 years ago

For the record, CLDR stands for "Common Locale Data Repository", and this seems to be its official home page http://cldr.unicode.org/

Are you suggesting that d3-format adds https://github.com/unicode-cldr/cldr-json as a dependency? There is no package.json there, also no JSON files that I can see. Here's an interesting segment of their README:

Because the CLDR is so large and contains so many different types of information, the JSON data here is grouped into packages by functionality. For each type of functionality, there are two available packages: The "[modern][]" packages, which contain the set of locales listed as modern coverage targets by the CLDR subcomittee, and the "[full][]" packages, which contain the complete set of locales, including those in the corresponding modern packages. The functional groups are:

  • [cldr-core][] – Basic CLDR supplemental data — only one package here, no "full" and "modern".
  • [cldr-dates][] – Data for date/time formatting, including data for Gregorian calendar. Requires that the corresponding [cldr-numbers][] package be installed as well.
  • cldr-cal-[type] – CLDR data for non-Gregorian calendars. [type] is one of the supported non-Gregorian calendar types in CLDR: [buddhist][], [chinese][], [coptic][], [dangi][], [ethiopic][], [hebrew][], [indian][], [islamic][], [japanese][], [persian][], or [roc][].
  • [cldr-localenames][] – Translated versions of locale display name elements: languages, scripts, territories, and variants.
  • [cldr-misc][] – Other CLDR data not defined elsewhere.
  • [cldr-numbers][] – Data for number formatting.
  • [cldr-rbnf][] – Rule Based Number Formatting data — only one package here, no "full" and "modern".
  • [cldr-segments][] – Line breaking data from Unicode's ULI project
  • [cldr-units][] – Data for units formatting.

Note that the links do not go anywhere.

thedavidmeister commented 7 years ago

@curran CLDR format is the JSON format for defining formatting, collation, etc. for many locales used by Unicode.

So, no I'm not really saying that CLDR is a dependency (it doesn't have to be).

I'm just saying that for each of decimal, thousands, grouping, currency support the CLDR equivalent config options.

So, for a concrete example, let's say we want to declare how to format numbers in en-AU (because I'm Australian 😉). We know that we want 1000000 to look like "1,000,000" as a string.

We can go to CLDR numbers modern (as opposed to numbers full) https://github.com/unicode-cldr/cldr-numbers-modern/tree/master/main and then find the JSON for en-AU https://github.com/unicode-cldr/cldr-numbers-modern/blob/master/main/en-AU/numbers.json.

It looks like this:

{
  "main": {
    "en-AU": {
      "identity": {
        "version": {
          "_number": "$Revision: 13050 $",
          "_cldrVersion": "30.0.3"
        },
        "language": "en",
        "territory": "AU"
      },
      "numbers": {
        "defaultNumberingSystem": "latn",
        "otherNumberingSystems": {
          "native": "latn"
        },
        "minimumGroupingDigits": "1",
        "symbols-numberSystem-latn": {
          "decimal": ".",
          "group": ",",
          "list": ";",
          "percentSign": "%",
          "plusSign": "+",
          "minusSign": "-",
          "exponential": "e",
          "superscriptingExponent": "×",
          "perMille": "‰",
          "infinity": "∞",
          "nan": "NaN",
          "timeSeparator": ":"
        },
        "decimalFormats-numberSystem-latn": {
          "standard": "#,##0.###",
          "long": {
            "decimalFormat": {
              "1000-count-one": "0 thousand",
              "1000-count-other": "0 thousand",
              "10000-count-one": "00 thousand",
              "10000-count-other": "00 thousand",
              "100000-count-one": "000 thousand",
              "100000-count-other": "000 thousand",
              "1000000-count-one": "0 million",
              "1000000-count-other": "0 million",
              "10000000-count-one": "00 million",
              "10000000-count-other": "00 million",
              "100000000-count-one": "000 million",
              "100000000-count-other": "000 million",
              "1000000000-count-one": "0 billion",
              "1000000000-count-other": "0 billion",
              "10000000000-count-one": "00 billion",
              "10000000000-count-other": "00 billion",
              "100000000000-count-one": "000 billion",
              "100000000000-count-other": "000 billion",
              "1000000000000-count-one": "0 trillion",
              "1000000000000-count-other": "0 trillion",
              "10000000000000-count-one": "00 trillion",
              "10000000000000-count-other": "00 trillion",
              "100000000000000-count-one": "000 trillion",
              "100000000000000-count-other": "000 trillion"
            }
          },
          "short": {
            "decimalFormat": {
              "1000-count-one": "0K",
              "1000-count-other": "0K",
              "10000-count-one": "00K",
              "10000-count-other": "00K",
              "100000-count-one": "000K",
              "100000-count-other": "000K",
              "1000000-count-one": "0M",
              "1000000-count-other": "0M",
              "10000000-count-one": "00M",
              "10000000-count-other": "00M",
              "100000000-count-one": "000M",
              "100000000-count-other": "000M",
              "1000000000-count-one": "0B",
              "1000000000-count-other": "0B",
              "10000000000-count-one": "00B",
              "10000000000-count-other": "00B",
              "100000000000-count-one": "000B",
              "100000000000-count-other": "000B",
              "1000000000000-count-one": "0T",
              "1000000000000-count-other": "0T",
              "10000000000000-count-one": "00T",
              "10000000000000-count-other": "00T",
              "100000000000000-count-one": "000T",
              "100000000000000-count-other": "000T"
            }
          }
        },
        "scientificFormats-numberSystem-latn": {
          "standard": "#E0"
        },
        "percentFormats-numberSystem-latn": {
          "standard": "#,##0%"
        },
        "currencyFormats-numberSystem-latn": {
          "currencySpacing": {
            "beforeCurrency": {
              "currencyMatch": "[:^S:]",
              "surroundingMatch": "[:digit:]",
              "insertBetween": " "
            },
            "afterCurrency": {
              "currencyMatch": "[:^S:]",
              "surroundingMatch": "[:digit:]",
              "insertBetween": " "
            }
          },
          "standard": "¤#,##0.00",
          "accounting": "¤#,##0.00;(¤#,##0.00)",
          "short": {
            "standard": {
              "1000-count-one": "¤0K",
              "1000-count-other": "¤0K",
              "10000-count-one": "¤00K",
              "10000-count-other": "¤00K",
              "100000-count-one": "¤000K",
              "100000-count-other": "¤000K",
              "1000000-count-one": "¤0M",
              "1000000-count-other": "¤0M",
              "10000000-count-one": "¤00M",
              "10000000-count-other": "¤00M",
              "100000000-count-one": "¤000M",
              "100000000-count-other": "¤000M",
              "1000000000-count-one": "¤0B",
              "1000000000-count-other": "¤0B",
              "10000000000-count-one": "¤00B",
              "10000000000-count-other": "¤00B",
              "100000000000-count-one": "¤000B",
              "100000000000-count-other": "¤000B",
              "1000000000000-count-one": "¤0T",
              "1000000000000-count-other": "¤0T",
              "10000000000000-count-one": "¤00T",
              "10000000000000-count-other": "¤00T",
              "100000000000000-count-one": "¤000T",
              "100000000000000-count-other": "¤000T"
            }
          },
          "unitPattern-count-one": "{0} {1}",
          "unitPattern-count-other": "{0} {1}"
        },
        "miscPatterns-numberSystem-latn": {
          "atLeast": "{0}+",
          "range": "{0}–{1}"
        }
      }
    }
  }
}

There isn't actually an en-AU entry in the equivalent place in d3 - https://github.com/d3/d3-format/tree/master/locale but if we refer to the en-GB data for d3 (close enough, right?) we get this https://raw.githubusercontent.com/d3/d3-format/master/locale/en-GB.json:

{
  "decimal": ".",
  "thousands": ",",
  "grouping": [3],
  "currency": ["£", ""]
}

You can see that CLDR JSON has much more info than d3 wants/needs, but that what d3 needs is a subset of the information provided by CLDR JSON.

For apps that are already using CLDR JSON to configure other i18n tools, it would be handy to also use the same data for d3.

thedavidmeister commented 7 years ago

also, from https://github.com/unicode-cldr/cldr-json#cldr-json

Installation

Installation using NPM:

$ npm install <package-name> , where <package-name> is one of the package names mentioned above, for example:

$ npm install cldr-dates-full

Installation using bower:

$ bower install <package-name> , where <package-name> is one of the package names mentioned above, for example:

$ bower install cldr-dates-full
thedavidmeister commented 7 years ago

hypothetically, if d3-format was CLDR compatible, then the solution to #21 (just as an example) would be to simply pull https://github.com/unicode-cldr/cldr-numbers-modern/blob/master/main/en-IN/numbers.json into the d3-format repo somehow and use it as-is

curran commented 7 years ago

Ah I see what you mean. Thanks for the clarification. So this would require modification of d3-format to be an "engine" or "compiler" of sorts for the CLDR specification.

There seems to already be a number of implementations available that do exactly that:

You mentioned that you're already using CLDR for everything except D3 axes. It should be possible to adopt one of the above libraries, and pass their formatting function into axis.tickFormat. Would that solve your use case?

thedavidmeister commented 7 years ago

@curran yes, i personally ended up using tickFormat to wrap the google closure lib's number formatting.

yes, i listed existing implementations of formatting/parsing as a potential way forward in this area if it was interesting to the d3 team.

i suppose the question is, why not just use an existing CLDR implementation and migrate away from the d3 i18n code in some future release?

there would be multiple benefits to this long term, as I listed earlier 😄

mbostock commented 7 years ago

You’re welcome to use an alternative number of date formatting library instead of d3-format and d3-time-format; that’s one of the goals of D3’s module system introduced in 4.0.

I don’t think it makes sense to have this library read CLDR JSON directly. CLDR is more expressive than the limited configuration supported by this library, and adding full support for CLDR features would replicate the working of existing CLDR libraries. Wouldn’t it make more sense to just use a library intended to consume CLDR, like jQuery globalize or moment-cldr?

That said, here are two approaches that would be reasonable:

  1. Writing a script that automatically converts CLDR JSON to the subset that d3-format and d3-time-format supports, reviewing these new locale definitions, and replacing (& extending) the current locale definitions in d3-format and d3-time-format.

  2. Making a d3-format-cldr and/or d3-time-format-cldr plugin that automatically converts CLDR JSON to the respective locale definition for d3-format and d3-time-format. (This is the same as approach 1, but it’s done on-the-fly and only benefits people who use these plugins.)

With either approach I expect there will be several difficult decisions regarding how to represent the more expressive CLDR format in terms that can be understood by d3-format and d3-time-format. Thus it would be important to document explicitly what is lost in the conversion. It could also be reasonable to propose specific new features to d3-format and d3-time-format based on what CLDR supports, but we should evaluate those on a case-by-case basis rather than attempting to replicate all of CLDR’s features.

mbostock commented 7 years ago

I’ve also opened #35 to fix #21 by adding a locale definition for en-IN. Given that we don’t support decimal format for numbers greater than 1e21 anyway (see #24), I believe this is a reasonable locale definition.

thedavidmeister commented 7 years ago

@mbostock

You’re welcome to use an alternative number of date formatting library instead of d3-format and d3-time-format; that’s one of the goals of D3’s module system introduced in 4.0.

Yup, it's awesome that I could provide my own formatter here. I do appreciate that 😄

I don’t think it makes sense to have this library read CLDR JSON directly.

What is the reasoning behind d3-format if not formatting (and maybe parsing #20 ) things into strings in an i18n friendly way? If that is the goal, CLDR support makes perfect sense to me as it is the world's most comprehensive and standards compliant repository of l10n formatting patterns.

en-IN isn't the only problem, it's just an example of one case where things can get tricky without a more expressive DSL. I believe that scaling from around 20-30 languages/locales to the ~800 locales that CLDR supports would quickly reveal more edge cases not currently covered by the d3 config/format system (e.g. i note there's a slot for formatting percentages and percent signs in CLDR that seems totally relevant to d3).

This isn't even really touching on other i18n issues that must affect d3 but aren't really the domain of d3-format. eg. how to collate values to correctly provide an ordered list for an axis.

Maybe this is actually a discussion for d3 itself rather than d3-format?

as you said in #21

I suppose it’d be nice to allow more arbitrary repeating sequences but without any other examples to go on, it’s hard to generalize. Trying to keep things simple.

that's all the CLDR format is supposed to be, a set of simple formats that are generalized enough to cover all locales, with tons of examples.