akihikodaki / cld3-ruby

cld3-ruby is an interface of Compact Language Detector v3 (CLD3) for Ruby.
Apache License 2.0
78 stars 22 forks source link

Empty and nonsense strings are detected as being in a language #17

Closed hult closed 6 years ago

hult commented 6 years ago

Hi,

It may be the underlying CLD3 library rather than your wrapper, but:

cld3 = CLD3::NNetLanguageIdentifier.new(0)
> cld3.find_language("")
=> #<struct Struct::Result language=:ja, probability=0.7837570905685425, :reliable?=true, proportion=1.0>
> cld3.find_language("123")
=> #<struct Struct::Result language=:ja, probability=0.7837570905685425, :reliable?=true, proportion=1.0>

You can get rid of this specific error by requiring at least one byte of data, but still:

> cld3 = CLD3::NNetLanguageIdentifier.new(1)
> cld3.find_language("a")
=> #<struct Struct::Result language=:lb, probability=0.9725591540336609, :reliable?=true, proportion=1.0>
akihikodaki commented 6 years ago

tl;dr

It may be just fine that it returns an arbitrary language if a string is too short. However, it says the probability is very high and the result is reliable. That's not good.

I skimmed the source code of the underlying library. Here is the cited probability calculation code:

  EmbeddingNetwork::Vector scores;
  network_.ComputeFinalScores(features, &scores);
  int prediction_id = -1;
  float max_val = -std::numeric_limits<float>::infinity();
  for (size_t i = 0; i < scores.size(); ++i) {
    if (scores[i] > max_val) {
      prediction_id = i;
      max_val = scores[i];
    }
  }

  // Compute probability.
  Result result;
  float diff_sum = 0.0;
  for (size_t i = 0; i < scores.size(); ++i) {
    diff_sum += exp(scores[i] - max_val);
  }
  const float log_sum_exp = max_val + log(diff_sum);
  result.probability = exp(max_val - log_sum_exp);

In short, it does not take account of the length of the string at all. The probability is higher if there are less features not supporting the result, and it is lower if there are such more features. As a short string gives few features, the probability remains high. The reliability is derived from the probability, so you cannot rely on the value, either.

So how can we get a reliable result? We need a string long enough to extract multiple features. The default requirement, which is also used by Chromium, is 140 characters. Chromium also sets the minium requirement to 0 characters (no requirement) in some cases. You may choose the value, depending on the reliability you require.