[ML] What to do about lang_ident for empty strings and numbers?

droberts195 commented 2 years ago

It's been pointed out that the lang_ident inference processor returns Japanese as the language for empty strings and numbers.

For example:

{
      "doc" : {
        "_index" : "_index",
        "_id" : "_id",
        "_source" : {
          "contents" : "",
          "_ml" : {
            "lang_ident" : {
              "prediction_score" : 0.7837568024575047,
              "model_id" : "lang_ident_model_1",
              "top_classes" : [
                {
                  "class_name" : "ja",
                  "class_probability" : 0.7837568024575047,
                  "class_score" : 0.7837568024575047
                },
                {
                  "class_name" : "ko",
                  "class_probability" : 0.14699680203424537,
                  "class_score" : 0.14699680203424537
                },
                {
                  "class_name" : "sr",
                  "class_probability" : 0.04528638971813643,
                  "class_score" : 0.04528638971813643
                }
              ],
              "prediction_probability" : 0.7837568024575047,
              "predicted_value" : "ja"
            }
          }
        },
        "_ingest" : {
          "timestamp" : "2021-12-20T14:46:31.19367Z"
        }
      }

Since we have nothing to go on in these cases, a configurable default is probably the best we can do. Alternatively we could treat the absence of any character in any alphabet as an error, and use the failure handler functionality of ingest processors to allow the user to supply the alternative processors to use in this case (which could just be a set processor to apply a default). Or maybe there's an even better solution. But we shouldn't just predict Japanese in this situation.

elasticmachine commented 2 years ago

Pinging @elastic/ml-core (Team:ML)

hendrikmuhs commented 2 years ago

ISO 639-1 (2-letter codes) unfortunately has no definition for "undefined", ISO 639-2/3 (3-letter codes) has a couple of "special codes", that we could apply for this case.

As we don't know the use case, I would not hard code a default. My preference (or not and):

return an "" as class_name with probability and score 1.0
add a parameter for the default language and return this as output, probability and score should probably be 0.0 in this case

Longer term, we could switch to ISO 639-3, we probably need a parameter and define a migration path for this switch.

stevedodson commented 2 years ago

CLD3 returns The model outputs BCP-47-style language codes, shown in the table below. For some languages, output is differentiated by script. Language and script names from Unicode CLDR.

An option could be to return "und": "Unknown language" which is consistent with https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en/languages.json

droberts195 commented 2 years ago

https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en/languages.json also has "zxx": "No linguistic content" as an option. That might be what was intended to be returned for empty strings and strings containing no letters at all in any script.

elastic / elasticsearch

[ML] What to do about lang_ident for empty strings and numbers? #81933