facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.94k stars 4.72k forks source link

Language detection is sensitive to punctuations . #495

Closed SriniNaga closed 6 years ago

SriniNaga commented 6 years ago

test text: "blast mango blast it" detected : fr test text: "blast, mango blast it" detected : en Does it require any special preprocessing before language detection

loretoparisi commented 6 years ago

@SriniNaga you should normalize text at least

normalize_text() {
    sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
        -e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
        -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
        -e 's/«/ /g' | tr 0-9 " "
}
EdouardGrave commented 6 years ago

Hi @SriniNaga,

Thank you for reporting this issue.

We did not apply any pre-processing on the data before training the language identifier. The motivation is that the pre-processing that should be applied to a piece of text depends on the language (e.g. tokenization is language dependent). For this reason, the language identification is sensitive to punctuation.

One of the issue here is that the text is very short, making the predictions of our model less reliable. We might release in the future a new version of the model which works better on very short texts.

Best, Edouard.

loretoparisi commented 6 years ago

@SriniNaga @EdouardGrave I'm just adding that we have successfully replaced both CLD2 / Google Translate API (for detection) with this model, since the accuracy for each language couples and even mixed-languages text was the same or even better. For this task in inference I'm applying a text pre-processing like the function above and some other kind of things like diacritics removal, and also double byte detection for chinese/japanese etc. Also in our case the text in inference is a document, not a short sentences (like tweets, etc).

NOTE: To be correct, I'm aware of a new version of CLD2 neural network based (word embedding) that is CLD3, but I haven't tried it yet, but I think this fastText model with language preprocessing is absolutely a good choice!

Hope this helps!

SriniNaga commented 6 years ago

@loretoparisi @EdouardGrave Thank you ! Most of our texts are short(Ranging from 20 chars to 2000 chars). For long texts the accuracy is coming good, but the problem is with short texts only.

rmlopes commented 2 years ago

One of the issue here is that the text is very short, making the predictions of our model less reliable. We might release in the future a new version of the model which works better on very short texts.

Hi @EdouardGrave , any news on this?