abadojack / whatlanggo

Natural language detection library for Go
MIT License
637 stars 64 forks source link

what means a super negative confidence rate #20

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi,

Hope you are all well !

I have -18.66532829205885 or -10.605926394815977 confidence rate, what does that mean ?

Language: Yoruba  Script: Latin  Confidence:  -8.652592309409306
Language: Turkmen  Script: Latin  Confidence:  -5.528339197102301
Language: Yoruba  Script: Latin  Confidence:  -8.163311123289779
Language: Chewa  Script: Latin  Confidence:  -0.8738781333466048
Language: Yoruba  Script: Latin  Confidence:  -7.287061394685147
Language: Yoruba  Script: Latin  Confidence:  -9.46254452788719
Language: Mandarin  Script: Han  Confidence:  1
Language: English  Script: Latin  Confidence:  -18.66532829205885
Language: Yoruba  Script: Latin  Confidence:  -10.605926394815977

Cheers, X

PaulRie commented 4 years ago

Hi, I noticed this weird behaviour too. It is a general problem of the Distance Calculation between the Trigrams in the Text and the Trigrams of a specific Language. The Distance between the Trigrams of a Text and the Trigrams of a certain Language is supposed to be a maximum of 90000. This occures for example if no Trigram in the text matches any Trigrams of this Language. Therefore the Distance is 300 for 300 Trigrams which sums up to 90000.

Later by calculating the Confidence the Distance is subtracted from this maximum value of 90000.

The Problem is, that if a text has more than 300 different trigrams (lets say 1000) the distance is calculated in a wrong way. If for example the Trigram on position 1000 in the text matches the any trigram in any language, the distance of these trigrams is calculated as abs(positionOfTrigramInLanguage - positionOfTrigramInText). Therefore the distance is somewhere between 700 and 1000 (more than 300). This can result in the OverallDistance being greater then 90000 and therefore the subtraction in the Confidence-Calculation gets negative. These negative Confidence values can also be less then -1

This is the line from detect.go, where the Comfidence rate is calculated:

rate := float64((score1 - score2)) / float64(score2)

If score2 is now negative, because of the situation i expained before, the rate becomes super negative

ghost commented 4 years ago

@PaulRie many thanks for the explanation.

Maybe @abadojack can fix it ?