formatjs / formatjs-old

The monorepo home to all of the FormatJS related libraries.
https://formatjs.io/
156 stars 53 forks source link

CLDR plural rule validation #175

Closed nbouvrette closed 5 years ago

nbouvrette commented 5 years ago

Which package? https://github.com/formatjs/formatjs/tree/master/packages/intl-messageformat-parser

Is your feature request related to a problem? Please describe. Based on the CLDR rules, when using ICU message plural (cardinal), it is possible to use categories that will never be used for a specific language.

Steps to reproduce the behavior:

Run this code:

// Sample English string
var testString = 'You have {count, plural, one {# hot dog} two {# hot dogs} many {# hot dogs} other {# hot dogs}} in your lunch bag.';
import {parse} from 'intl-messageformat-parser';
const ast = parse(testString);
console.dir(testString);

The message will be parsed without ever using the two or many plural forms since English CLDR rules only requires one and other.

Describe the solution you'd like If using invalid plural categories for a language, an exception should be thrown.

As far as I know, the parser does not support taking language parameters - it could be an optional parameter to apply language-specific validation.

Describe alternatives you've considered Right now I am using this temporary workaround to validate plural rules, but it would be better to be included in the parser directly:

    /**
     * Validate language specific plural forms.
     *
     * @param {Array} pluralCategories - The plural categories from a specific plural statement in a message.
     * @param {string} language         - The language of the message.
     *
     * @throws {Error} when a plural rule does not meet CLDR requirements for given language.
     */
    validatePluralForms: function (pluralCategories, language) {
        // Remove explicit categories.
        pluralCategories = pluralCategories.filter(function(category) {
            return category[0] !== '=';
        });

        // Check plural rules per language.
        if (language === 'en') {
            if (pluralCategories.length !== 2 || !pluralCategories.includes('one') ||
                !pluralCategories.includes('other')) {
                throw new Error("English must use 2 plural forms: 'one' and 'other'");
            }
        }
    }
longlho commented 5 years ago

I'm not sure I understand your request. So this is not just CLDR restrictions, but some translation vendors don't support the full set of plural rules, so even though there are, say 3 rules, they might only support one & other.

nbouvrette commented 5 years ago

Do you have a concrete example to help me understand why would a vendor only want to translate a subset of the CLDR plural forms? Is it just for cost reasons or more about that the CLDR has missing rules or rules that people don't agree with?

I am actually in the opposite situation where to avoid learning the intricacies of the ICU syntax, linguists proposed to explicitly set all the tags all the time, regardless if they are used or not. This is why I am planning to release a visual tool soon to help avoid this.

I was thinking that having a way to know when a plural form is missing could help them make the right decision but I had never thought that one might not want to use all CLDR forms.

longlho commented 5 years ago

This might fall nicely into our future CLI that I’m planning to write :)

longlho commented 5 years ago

I’m hesitant to put this in the parser because the parser rn only parses and does not enforce any CLDR. It doesn’t have access to that dataset.

nbouvrette commented 5 years ago

This might fall nicely into our future CLI that I’m planning to write :)

You got me curious - is the project available anywhere yet?

I’m hesitant to put this in the parser because the parser rn only parses and does not enforce any CLDR. It doesn’t have access to that dataset.

Yes, maybe this is more of a validator/tool feature, which could pull the data from the CLDR to issue warnings and not exceptions. I did more research this morning and found at least one case where you could want to not add the CLDR rule but rather use an explicit value:

Instead of:

You have {count, plural, =0 {no unread messages} one {# unread message} other {# unread messages}}.

You could use (without the need to use one since it is already covered by =1:

You have {count, plural, =0 {no unread messages} =1 {one unread message} other {# unread messages}}.

Since its common to want to specify special content for 0 and 1, and a lot of languages have rules on them, I think it is reasonable to think that enforcing the usage of all rules would be too aggressive.

I'll see how I can fit this type of feature into a validation tool rather than the parser itself.

longlho commented 5 years ago

Dropbox has a bunch of internal tooling for this and I’m pushing to open source them. Things like linter for ICU and whatnot since our translation vendor has a bunch of restrictions (e.g no selectordinal, no nested plurals...)

On Sun, Sep 8, 2019 at 2:26 PM Nicolas Bouvrette notifications@github.com wrote:

This might fall nicely into our future CLI that I’m planning to write :)

You got me curious - is the project available anywhere yet?

I’m hesitant to put this in the parser because the parser rn only parses and does not enforce any CLDR. It doesn’t have access to that dataset.

Yes, maybe this is more of a validator/tool feature, which could pull the data from the CLDR to issue warnings and not exceptions. I did more research this morning and found at least one case where you could want to not add the CLDR rule but rather use an explicit value:

Instead of:

You have {count, plural, =0 {no unread messages} one {# unread message} other {# unread messages}}.

You could use (without the need to use one since it is already covered by =1:

You have {count, plural, =0 {no unread messages} =1 {one unread message} other {# unread messages}}.

Since its common to want to specify special content for 0 and 1, and a lot of languages have rules on them, I think it is reasonable to think that enforcing the usage of all rules would be too aggressive.

I'll see how I can fit this type of feature into a validation tool rather than the parser itself.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/formatjs/formatjs/issues/175?email_source=notifications&email_token=AABQM345VI5VPI2RMAWFJSLQIU7UJA5CNFSM4IUQWED2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6FWMYI#issuecomment-529229409, or mute the thread https://github.com/notifications/unsubscribe-auth/AABQM35UQ3FU2SUC4FMQBP3QIU7UJANCNFSM4IUQWEDQ .