Closed hult closed 6 years ago
tl;dr
It may be just fine that it returns an arbitrary language if a string is too short. However, it says the probability is very high and the result is reliable. That's not good.
I skimmed the source code of the underlying library. Here is the cited probability calculation code:
EmbeddingNetwork::Vector scores;
network_.ComputeFinalScores(features, &scores);
int prediction_id = -1;
float max_val = -std::numeric_limits<float>::infinity();
for (size_t i = 0; i < scores.size(); ++i) {
if (scores[i] > max_val) {
prediction_id = i;
max_val = scores[i];
}
}
// Compute probability.
Result result;
float diff_sum = 0.0;
for (size_t i = 0; i < scores.size(); ++i) {
diff_sum += exp(scores[i] - max_val);
}
const float log_sum_exp = max_val + log(diff_sum);
result.probability = exp(max_val - log_sum_exp);
In short, it does not take account of the length of the string at all. The probability is higher if there are less features not supporting the result, and it is lower if there are such more features. As a short string gives few features, the probability remains high. The reliability is derived from the probability, so you cannot rely on the value, either.
So how can we get a reliable result? We need a string long enough to extract multiple features. The default requirement, which is also used by Chromium, is 140 characters. Chromium also sets the minium requirement to 0 characters (no requirement) in some cases. You may choose the value, depending on the reliability you require.
Hi,
It may be the underlying CLD3 library rather than your wrapper, but:
You can get rid of this specific error by requiring at least one byte of data, but still: