aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.3k stars 337 forks source link

Language Detection: weird errors #138

Open AndRossi opened 6 years ago

AndRossi commented 6 years ago

Hello Polyglot team,

First of all thank you for your great work - polyglot is amazing!

I am opening an issue because I am experiencing something quite weird with your language detection with a few sentences.

This is one of the sentences: "Foglione j'adore share < Retour accepter la nouvelle règle participer je prends vision du reglement amis deja sortis Attention! J'accepte les termes et conditions la promotion la confidentialité amis déjà inscrits ou invités";

I know it is quite messy and nonsensical, but it should be definitely recognized as French. Instead, the Polyglot Detector returns these languages and probabilities: it:95.0; un:0.0; un:0.0

I thought that the first word "foglione" might be the cause of the error (even though it's just one word in a more than 30 words long sentence...) so I removed it but the string starting with "j'adore..." gives me this output: en:93.0; un:0.0; un:0.0

I only get French if I remove "share" too

Am I missing something?

Andrea

janissl commented 6 years ago

It looks that the polyglot only reads your string from the beginning to the angle bracket (first 24 characters) and the remaining part is ignored (considered as an HTML tag).

AndRossi commented 6 years ago

Hi @janissl ,

Thank you for your quick answer. You are right: before sending my strings to the Polyglot Detector, I strip XML tags away and unescape HTML entities; however, I did not think that an open angle bracket like that might be considered as a tag by the Detector. My bad 😅 I am going to remove all the angle brackets left before using the Detector, then.

In your experience, are there any other character or character sequences that I should check? (As in the sentence in my previous comment, in my sentences I can have literally any characters...)

Thank you again, and again congratulations for Polyglot: it is a really incredible tool!

Andrea