Empty and nonsense strings are detected as being in a language

tl;dr

The underlying library does not provide any clue to reject such unreliable result.
Determine the minimum length requirement considering your requirement.

It may be just fine that it returns an arbitrary language if a string is too short. However, it says the probability is very high and the result is reliable. That's not good.

I skimmed the source code of the underlying library. Here is the cited probability calculation code:

  EmbeddingNetwork::Vector scores;
  network_.ComputeFinalScores(features, &scores);
  int prediction_id = -1;
  float max_val = -std::numeric_limits<float>::infinity();
  for (size_t i = 0; i < scores.size(); ++i) {
    if (scores[i] > max_val) {
      prediction_id = i;
      max_val = scores[i];
    }
  }

  // Compute probability.
  Result result;
  float diff_sum = 0.0;
  for (size_t i = 0; i < scores.size(); ++i) {
    diff_sum += exp(scores[i] - max_val);
  }
  const float log_sum_exp = max_val + log(diff_sum);
  result.probability = exp(max_val - log_sum_exp);

In short, it does not take account of the length of the string at all. The probability is higher if there are less features not supporting the result, and it is lower if there are such more features. As a short string gives few features, the probability remains high. The reliability is derived from the probability, so you cannot rely on the value, either.

So how can we get a reliable result? We need a string long enough to extract multiple features. The default requirement, which is also used by Chromium, is 140 characters. Chromium also sets the minium requirement to 0 characters (no requirement) in some cases. You may choose the value, depending on the reliability you require.

akihikodaki / cld3-ruby

Empty and nonsense strings are detected as being in a language #17